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Preface 


The subject of this textbook is the analysis of Boolean functions. Roughly 
speaking, this refers to studying Boolean functions f : {0, 1}” — {0, 1} via 
their Fourier expansion and other analytic means. Boolean functions are perhaps 
the most basic object of study in theoretical computer science, and Fourier 
analysis has become an indispensable tool in the field. The topic has also played 
a key role in several other areas of mathematics, from combinatorics, random 
graph theory, and statistical physics, to Gaussian geometry, metric/Banach 
spaces, and social choice theory. 

The intent of this book is both to develop the foundations of the field and 
to give a wide (though far from exhaustive) overview of its applications. Each 
chapter ends with a “highlight” showing the power of analysis of Boolean func- 
tions in different subject areas: property testing, social choice, cryptography, 
circuit complexity, learning theory, pseudorandomness, hardness of approxi- 
mation, concrete complexity, and random graph theory. 

The book can be used as a reference for working researchers or as the basis 
of a one-semester graduate-level course. The author has twice taught such a 
course at Carnegie Mellon University, attended mainly by graduate students 
in computer science and mathematics but also by advanced undergraduates, 
postdocs, and researchers in adjacent fields. In both years most of Chap- 
ters 1-5 and 7 were covered, along with parts of Chapters 6, 8, 9, and 11, 
and some additional material on additive combinatorics. Nearly 500 exercises 
are provided at the ends of the book’s chapters. 

Additional material related to the book can be found at its website: 


http://analysisofbooleanfunctions.org 


This includes complete lecture notes from the author’s 2007 course, complete 
lecture videos from the author’s 2012 course, blog updates related to analysis 
of Boolean functions, an electronic draft of the book, and errata. The author 
would like to encourage readers to post any typos, bugs, clarification requests, 
and suggestions to this website. 


xi 
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Boolean Functions and the Fourier Expansion 


In this chapter we describe the basics of analysis of Boolean functions. We 
emphasize viewing the Fourier expansion of a Boolean function as its repre- 
sentation as a real multilinear polynomial. The viewpoint based on harmonic 
analysis over F5 is mostly deferred to Chapter 3. We illustrate the use of basic 
Fourier formulas through the analysis of the Blum—Luby—Rubinfeld linearity 
test. 


1.1. On Analysis of Boolean Functions 
This is a book about Boolean functions, 
f : {0, 1}” — {0, 1}. 
Here f maps each length-n binary vector, or string, into a single binary value, 


or bit. Boolean functions arise in many areas of computer science and mathe- 
matics. Here are some examples: 


In circuit design, a Boolean function may represent the desired behavior of 
a circuit with n inputs and one output. 

In graph theory, one can identify v-vertex graphs G with length-(5) strings 
indicating which edges are present. Then f may represent a property of such 
graphs; e.g., f(G) = 1 if and only if G is connected. 

e In extremal combinatorics, a Boolean function f can be identified with a 
“set system” F on [n] = {1,2,..., n}, where sets X C [n] are identified 
with their 0-1 indicators and X € F if and only if f(X) = 1. 

In coding theory, a Boolean function might be the indicator function for the 
set of messages in a binary error-correcting code of length n. 
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e In learning theory, a Boolean function may represent a “concept” with n 
binary attributes. 

e In social choice theory, a Boolean function can be identified with a “voting 
rule” for an election with two candidates named 0 and 1. 


We will be quite flexible about how bits are represented. Sometimes we 
will use True and False; sometimes we will use —1 and 1, thought of as real 
numbers. Other times we will use 0 and 1, and these might be thought of as 
real numbers, as elements of the field Fz of size 2, or just as symbols. Most 
frequently we will use —1 and 1, so a Boolean function will look like 


f :{-1, 1}" > {-1, 1}. 


But we won’t be dogmatic about the issue. 

We refer to the domain of a Boolean function, {—1, 1}”, as the Hamming cube 
(or hypercube, n-cube, Boolean cube, or discrete cube). The name “Hamming 
cube” emphasizes that we are often interested in the Hamming distance between 
strings x, y E€ {—1, 1}", defined by 


A(x, y) = #{i : x; F yi}. 


Here we’ve used notation that will arise constantly: x denotes a bit string, and 
x; denotes its ith coordinate. 

Suppose we have a problem involving Boolean functions with the following 
two characteristics: 


e the Hamming distance is relevant; 
e you are counting strings, or the uniform probability distribution on {—1, 1}” 
is involved. 


These are the hallmarks of a problem for which analysis of Boolean functions 
may help. Roughly speaking, this means deriving information about Boolean 
functions by analyzing their Fourier expansion. 


1.2. The “Fourier Expansion”: Functions as Multilinear Polynomials 


The Fourier expansion of a Boolean function f : {—1, 1}”" > {—1, 1} 1s simply 
its representation as a real, multilinear polynomial. (Multilinear means that no 
variable x; appears squared, cubed, etc.) For example, suppose n = 2 and 
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f = max,, the “maximum” function on 2 bits: 
max2(+1,+1) = +1, 
max 7(—1,+1)= +1, 
max2(+1, —1) = +1, 
max>(—1,-1) = —1. 
Then max, can be expressed as a multilinear polynomial, 
max2(x1, x2) = 5 + 5X] + 5x2 — 5X1X2} (1.1) 


this is the “Fourier expansion” of max. As another example, consider the 
majority function on 3 bits, Maj; : {—1, 1} —> {—1, 1}, which outputs the +1 
bit occurring more frequently in its input. Then it’s easy to verify the Fourier 
expansion 


ni 1 1 1 1 
Maj3(%1, x2, X3) = 5X1 + 3X2 + 5X3 — 3X1X2X3. (1.2) 


The functions max and Maj; will serve as running examples in this chapter. 

Let’s see how to obtain such multilinear polynomial representations in 
general. Given an arbitrary Boolean function f : {—1, 1}” — {-1, 1} there 
is a familiar method for finding a polynomial that interpolates the 2” values 
that f assigns to the points {—1, 1}” c R”. For each point a = (aj, ..., an) € 
{—1, 1}" the indicator polynomial 


Lay (x) = (11) (122)... (Iate) 


takes value 1 when x = a and value 0 when x € {—1, 1}” \ {a}. Thus f has the 
polynomial representation 


V= $ flaw). 
ae{—1,1}" 
Illustrating with the f = max, example again, we have 
1) (5) (7) 
+ œD) (2) (1.3) 
+ GD (%") 5%) 
( 


+ DGAC) = pjat po- pan 


max2(x) 


Let us make two remarks about this interpolation procedure. First, it works 
equally well in the more general case of real-valued Boolean functions, 
f :{—1, 1} — R. Second, since the indicator polynomials are multilinear 
when expanded out, the interpolation always produces a multilinear polynomial. 
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Indeed, it makes sense that we can represent functions f : {—1, 1}” —> R with 
multilinear polynomials: since we only care about inputs x where x; = +1, any 
factor of x? can be replaced by 1. 

We have illustrated that every f : {—1, 1}” —> R can be represented by a 
real multilinear polynomial; as we will see in Section 1.3, this representation 
is unique. The multilinear polynomial for f may have up to 2” terms, corre- 
sponding to the subsets S C [n]. We write the monomial corresponding to S 
as 


x5 = I] x; (with x? = 1 by convention), 
ieS 
and we use the following notation for its coefficient: 
fis ) = coefficient on monomial x° in the multilinear representation of f. 


This discussion is summarized by the Fourier expansion theorem: 


Theorem 1.1. Every function f : {—1, 1}” —> R can be uniquely expressed as 
a multilinear polynomial, 


f@)= > FOs. (1.4) 
SC[n] 
This expression is called the Fourier expansion of f, and the real number KS ) 
is called the Fourier coefficient of f on S. Collectively, the coefficients are 
called the Fourier spectrum of f. 


As examples, from (1.1) and (1.2) we obtain: 


maxz(Ø) = max({1) =; maxp({2})= 5, max({1, 2) = —3; 


1 
Dg 2? 


Maj;({1}), Maj,({2}), Maj,({3}) = 4, Maj,({1,2,3}) = — 


1 d 
2? 23 


Maj,(S) = 0 else. 


We finish this section with some notation. It is convenient to think of the 
monomial x* as a function on x = (x1,..., Xn) € R”; we write it as 


xs(x) = | [x 
ieS 
Thus we sometimes write the Fourier expansion of f : {—1, 1}” > Ras 


fœ@= D> FS) xs). 


S¢[n] 


1.3. The Orthonormal Basis of Parity Functions 3. 


So far our notation makes sense only when representing the Hamming cube by 
{—1, 1}" C R”. The other frequent representation we will use for the cube is F3. 
We can define the Fourier expansion for functions f : F} —> R by “encoding” 
input bits 0, 1 € Fz by the real numbers —1, 1 € R. We choose the encoding 
x : F2 — R defined by 


x(Or,) = +1, xr.) = —-1. 


This encoding is not so natural from the perspective of Boolean logic; e.g., it 
means the function max, we have discussed represents logical AND. But it’s 
mathematically natural because for b € F we have the formula x (b) = (— 1)’. 
We now extend the xs notation: 


Definition 1.2. For S C [n] we define xs : F} > R by 
Xs(x) = I] x(x) = (—1)2 165%, 
ics 


which satisfies 


xs + y) = Xs(X)X5(y). (1.5) 


In this way, given any function f : F5 —> R it makes sense to write its 
Fourier expansion as 


fœ@= X fO xs. 
S¢[n] 
In fact, if we are really thinking of F; the n-dimensional vector space over 
F2, it makes sense to identify subsets S C [n] with vectors y € F3. This will 
be discussed in Chapter 3.2. 


1.3. The Orthonormal Basis of Parity Functions 


For x € {—1,1}", the number xs(x) = ies xi is in {—1, 1}. Thus xs: 
{—1, 1} —> {-1, 1} is a Boolean function; it computes the logical parity, or 
exclusive-or (XOR), of the bits (x;);-5. The parity functions play a special role 
in the analysis of Boolean functions: the Fourier expansion 


f= >" fxs (1.6) 
Scinn] 


shows that any f can be represented as a linear combination of parity functions 
(over the reals). 
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It’s useful to explore this idea further from the perspective of linear algebra. 
The set of all functions f : {—1, 1}" —> R forms a vector space V, since we 
can add two functions (pointwise) and we can multiply a function by a real 
scalar. The vector space V is 2”-dimensional: if we like we can think of the 
functions in this vector space as vectors in R2", where we stack the 2” values 
f(x) into a tall column vector (in some fixed order). Here we illustrate the 
Fourier expansion (1.1) of the max function from this perspective: 


+1 +1 +1 +1 +1 
maal Haan a ee ae 
come la +1 +1 -1 = 
=i +1 et =i +1 

(1.7) 


More generally, the Fourier expansion (1.6) shows that every function 
f : {-1, 1}" — R in V is a linear combination of the parity functions; i.e., 
the parity functions are a spanning set for V. Since the number of parity func- 
tions is 2” = dim V, we can deduce that they are in fact a linearly independent 
basis for V. In particular this justifies the uniqueness of the Fourier expansion 
stated in Theorem 1.1. 

We can also introduce an inner product on pairs of function 
f, g : {—1, 1}" — R in V. The usual inner product on R?” would correspond 
to Drei f(x)e(x), but it’s more convenient to scale this by a factor of 
2”, making it an average rather than a sum. In this way, a Boolean function 
f:{—1, 1} > {-1, 1} will have (f, f) = 1, i.e., be a “unit vector”. 


Definition 1.3. We define an inner product (-,-) on pairs of function 
f.g :{—1, lI!" — R by 


(fa=2" DT SOs E pO 8) 


xe{—1,1}" ca; 
We also use the notation || f|l2 = /(f, f), and more generally, 
Lfl = EUS @71””. 


Here we have introduced probabilistic notation that will be used heavily 
throughout the book: 
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Notation 1.4. We write x ~ {—1, 1}” to denote that x is a uniformly chosen 
random string from {—1, 1}”. Equivalently, the n coordinates x; are indepen- 
dently chosen to be +1 with probability 1/2 and —1 with probability 1/2. We 
always write random variables in boldface. Probabilities Pr and expectations 
E will always be with respect to a uniformly random x ~ {—1, 1}” unless oth- 
erwise specified. Thus we might write the expectation in (1.8) as E,[ f(x) g(x)] 
or E[ f(x) g(x)] or even E[ fg]. 


Returning to the basis of parity functions for V, the crucial fact underlying 
all analysis of Boolean functions is that this is an orthonormal basis. 


Theorem 1.5. The 2” parity functions xs :{—1,1}”" > {—1, 1} form an 
orthonormal basis for the vector space V of functions {—1, 1}" > R; i.e., 


Ca TEA 

XS: XT) = 

DAT Vo FSET. 

Recalling the definition (xs, xr) = E[xs(x)xr(x)], Theorem 1.5 follows 
immediately from two facts: 


Fact 1.6. For x € {—1, 1}” it holds that x5(x)xr(x) = xsar(x), where SAT 
denotes symmetric difference. 


Proof. xs(x)xr(x) = | [æ] [~ = I] Xi I] x? = I] Xi = Xsar(*). 


ieS ieT 1EeSAT iESAT iESAT 
1 ifS=ð, 
Fact 1.7. Elys(x)] = EIT] x] = | 4 
MM 0 SFO. 


Proof. If S = Ø then E[xs(x)] = E[1] = 1. Otherwise, 
e[l] e 
Es ieS 


because the random bits x;,...,x, are independent. But each of the factors 
E[x;] in the above (nonempty) product is (1/2)(+1) + (1/2X—1) = 0 


1.4. Basic Fourier Formulas 


As we have seen, the Fourier expansion of f : {—1, 1}” —> R can be thought 
of as the representation of f over the orthonormal basis of parity functions 
(Xs)sc{n}- In this basis, f has 2” “coordinates”, and these are precisely the 
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Fourier coefficients of f. The “coordinate” of f in the xs “direction” is ( f, Xs); 
i.e., we have the following formula for Fourier coefficients: 


Proposition 1.8. For f : {—1,1}" —> R and S C [n], the Fourier coefficient 
of f on S is given by 


F(S) = x)= ELF xs). 


x~ 


We can verify this formula explicitly: 


(f, Xs) = Yo FD) xr. s) = > fxr. xs) = FS) A9) 
Tecin] Tecin] 

where we used the Fourier expansion of f, the linearity of (-,-), and finally 

Theorem 1.5. This formula is the simplest way to calculate the Fourier coef- 

ficients of a given function; it can also be viewed as a streamlined version of 

the interpolation method illustrated in (1.3). Alternatively, this formula can be 

taken as the definition of Fourier coefficients. 

The orthonormal basis of parities also lets us measure the squared “length” 
(2-norm) of f : {—1, 1}” — R efficiently: it’s just the sum of the squares of 
f’s “coordinates” — i.e., Fourier coefficients. This simple but crucial fact is 
called Parseval’s Theorem. 


Parseval’s Theorem. For any f : {—1, 1}!" —> R, 
= 27 ROA F702 
(Ff) = E pO 27 AO. 
Sin] 
In particular, if f : {—1, 1} — {—1, 1} is Boolean-valued then 
Yost 
SC[n] 


As examples we can recall the Fourier expansions of max and Maj;: 
= eg I 1 1 Maj _1 1 1 1 
max2(x) = x + 3X1 + 5X2 — 5X142, aj3(x) = 3X1 + 5X2 + 5X3 — 5X1X243. 


In both cases the sum of squares of Fourier coefficients is 4 x (1/4) = 1. 

More generally, given two functions f, g : {—1, 1}” —> R, we can compute 
(f, g) by taking the “dot product” of their coordinates in the orthonormal basis 
of parities. The resulting formula is called Plancherel’s Theorem. 


Plancherel’s Theorem. For any f, g :{—1, 1 —> R, 
e= E pO D7 ARS). 


ve SCIn] 
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We can verify this formula explicitly as we did in (1.9): 


LD =E Oxs E BP) xr) = YO ARTs. xr) 


SC[n] TC[n] S,TC[n] 
= > ASS). 
SC[n] 


Now is a good time to remark that for Boolean-valued functions 
f,g:{-l, 1)” — {-1, 1}, the inner product (f, g) can be interpreted as a 
kind of “correlation” between f and g, measuring how similar they are. Since 
fga) = 1 if f(x) = g(x) and f(x)g(x) = —1 if fœ) ¥ g(a), we have: 


Proposition 1.9. If f, g : {—1, 1} > {-1, 1} 
(f, 8) = Pri f(x) = g(x)] — Pri f (Œ) F g(x)] = 1 — 2dist( f, g). 


Here we are using the following definition: 


Definition 1.10. Given f, g :{—1, 1}” —> {—1, 1}, we define their (relative 
Hamming) distance to be 


dist( f, g) = Pri fœ) # gœ), 
the fraction of inputs on which they disagree. 


With a number of Fourier formulas now in hand we can begin to illustrate 
a basic theme in the analysis of Boolean functions: interesting combinatorial 
properties of a Boolean function f can be “read off” from its Fourier coeffi- 
cients. Let’s start by looking at one way to measure the “bias” of f: 


Definition 1.11. The mean of f : {-1,1}” —> R is E[f]. When f has 
mean 0 we say that it is unbiased, or balanced. In the particular case that 
f :{=1, 1}" — {-1, 1} is Boolean-valued, its mean is 


E[f] = Prif = 1] — Pr[f = —1), 


thus f is unbiased if and only if it takes value 1 on exactly half of the points of 
the Hamming cube. 


Fact 1.12. If f : {—1, 1}" > R thenE[f] = FØ). 


This formula holds simply because E[ f] = (f, 1) = FØ) (taking S = Ø in 
Proposition 1.8). In particular, a Boolean function is unbiased if and only if its 
empty-set Fourier coefficient is 0. 

Next we obtain a formula for the variance of a real-valued Boolean function 
(thinking of f(x) as a real-valued random variable): 
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Proposition 1.13. The variance of f : {—1, 1}" > Ris 


Var[ f] = (f — ELS], f — ELFI) = ELf71- ELS? = X fisy. 


SLD 


This Fourier formula follows immediately from Parseval’s Theorem and 
Fact 1.12. 


Fact 1.14. For f : {-1, 1}" > {-1, 1}, 
Var[ f] = 1 — E[ f]? = 4Pr[ f(x) = 1] Pr[ f(x) = —1] € [0, 1]. 


In particular, a Boolean-valued function f has variance | if it’s unbiased and 
variance 0 if it’s constant. More generally, the variance of a Boolean-valued 
function is proportional to its “distance from being constant’. 


Proposition 1.15. Let f : {-1,1}" —> {-1, 1}. Then 2e < Var[f] < 4e, 
where 


e = min{dist( f, 1), dist( f, —1)}. 


The proof of Proposition 1.15 is an exercise. See also Exercise 1.17. 
By using Plancherel in place of Parseval, we get a generalization of Propo- 
sition 1.13 for covariance: 


Proposition 1.16. The covariance of f, g : {—1, 1}” —> Ris 
Cov[ f, 8] = (f — ELS], g — Elgl) = EL fg] — ELF] Els] = X f(S)a(S). 


SAD 
We end this section by discussing the Fourier weight distribution of Boolean 
functions. 


Definition 1.17. The (Fourier) weight of f : {—1, 1}" —> R on set S is defined 
to be the squared Fourier coefficient, f (S). 


Although we lose some information about the Fourier coefficients when 
we square them, many Fourier formulas only depend on the weights of f. 
For example, Proposition 1.13 says that the variance of f equals its Fourier 
weight on nonempty sets. Studying Fourier weights is particularly pleasant for 
Boolean-valued functions f : {—1, 1}” —> {—1, 1} since Parseval’s Theorem 
says that they always have total weight 1. In particular, they define a probability 
distribution on subsets of [n]. 


Definition 1.18. Given f : {—1, 1}” — {—1, 1}, the spectral sample for f, 
denoted $ p, is the probability distribution on subsets of [n] in which the set S 
has probability f (S). We write S ~ $ s for a draw from this distribution. 
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Figure 1.1. Fourier weight distribution of the Maj, function 


For example, the spectral sample for the max, function is the uniform 
distribution on all four subsets of [2]; the spectral sample for Maj, is the 
uniform distribution on the four subsets of [3] with odd cardinality. 

Given a Boolean function it can be helpful to try to keep a mental picture 
of its weight distribution on the subsets of [n], partially ordered by inclusion. 
Figure 1.1 is an example for the Maj, function, with the white circles indicating 
weight 0 and the shaded circles indicating weight 1/4. 

Finally, as suggested by the diagram we often stratify the subsets S$ C [n] 
according to their cardinality (also called “height” or “level”). Equivalently, 
this is the degree of the associated monomial x°. 


Definition 1.19. For f : {—1, 1} —> R and 0 < k < n, the (Fourier) weight 
of f at degree k is 


Wifl= >. fOr. 
SC[n] 
|S|=k 


If f : {-1, 1}” — {-1, 1} is Boolean-valued, an equivalent definition is 


k a Z 
WwW'If]= one =k]. 


By Parseval’s Theorem, W*[ f] = || f=* iż where 


Haa F(S) xs 


|S|=k 


is oe the degree | k part of f. We will also sometimes use notation like 


wtf = Vistek FS? and {= Disie FS) Xs- 
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1.5. Probability Densities and Convolution 


For variety’s sake, in this section we write the Hamming cube as F} rather 
than {—1, 1}”. In developing the Fourier expansion, we have generalized from 
Boolean-valued Boolean functions f : F; — {—1, 1} to real-valued Boolean 
functions f : F} — R. Boolean-valued functions arise more often in com- 
binatorial problems, but there are important classes of real-valued Boolean 
functions. One example is probability densities. 


Definition 1.20. A (probability) density function on the Hamming cube F% is 
any nonnegative function gy : F} —> R=° satisfying 


E [g(@)] = 1. 
x~F; 
We write y ~ ¢ to denote that y is a random string drawn from the associated 
probability distribution, defined by 
1 
Pr[y=y]l=90)— Vy ek. 
y~e 2 
Here you should think of (y) as being the relative density of y with respect 
to the uniform distribution on F}. For example, we have: 


Fact 1.21. If ọ is a density function and g : {—1, 1}" > R, then 


jE BOIS (p8) = E lo)g@)I.- 


E 
x~F; 


The simplest example of a probability density is just the constant function 1, 
which corresponds to the uniform probability distribution on F5. The most 
common case arises from the uniform distribution over some subset A C F3. 


Definition 1.22. If A C F} we write 14 : F} — {0, 1} for the 0-1 indicator 
function of A; i.e., 


1 ifxeA, 


1 = 
ea mere et 


Assuming A 4 Ø we write gy, for the density function associated to the uniform 
distribution on A; i.e., 
wee! 1 
Pa = May A 
We typically write y ~ A rather than y ~ p4. 


A simple but useful example is when A is the singleton set A = {0}. (Here 0 
is denoting the vector (0, 0,..., 0) € F35.) In this case the function go} takes 
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value 2” on input 0 € F and is zero elsewhere on F3. In Exercise 1.1 you will 
verify the Fourier expansion of {o}: 


Fact 1.23. Every Fourier coefficient of po is 1; i.e., its Fourier expansion is 


Poy) = >> xs(y). 


SC[n] 


We now introduce an operation on functions that interacts particularly nicely 
with density functions, namely, convolution. 


Definition 1.24. Let f, g : F} — R. Their convolution is the function f * g : 
IF, — R defined by 


FDE E IOE- I= E fE- yy). 
y~F3 y~F} 


Since subtraction is equivalent to addition in F} we may also write 


(f* OO) = ELF += E fE +O). 


If we were representing the Hamming cube by {—1, 1}” rather than F} we 
would replace x + y with x o y, where o denotes entry-wise multiplication. 


Exercise 1.25 asks you to verify that convolution is associative and commu- 
tative: 


f *(g*h)=(f *g)*h, f*g=8* f. 
Using Fact 1.21 we can deduce the following two simple results: 
Proposition 1.25. /f ọ is a density function on F5 and g : F — R then 
o * g(x) = E [g(x — y)] = E [g + y)]. 
y~y y~y 
In particular, Ey~,[g(y)] = p * g(0). 
Proposition 1.26. If g = w is itself a probability density function then so is 


gy * Y; it represents the distribution on x € F; given by choosing y ~ ọ and 
z ~ w independently and setting x = y + Z. 


The most important theorem about convolution is that it corresponds to 
multiplication of Fourier coefficients: 


Theorem 1.27. Let f, g : F} — R. Then for all S C [n], 


fF * a(S) = FOS). 
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Proof. We have 


f * g(S) = ate LCF * g(x) xs(*)] (the Fourier formula) 

=F. |, ELS) - vis (by definition) 

= Ey [OOX] (esx — y is uniform on F} Vx) 
independently 

= ae (y)xs(W)8Z)xs(Z)] (by identity (1.5)) 

= KORG ) (Fourier formula, independence), 


as claimed. 


1.6. Highlight: Almost Linear Functions and the BLR Test 


In linear algebra there are two equivalent definitions of what it means for a 
function to be linear: 


Definition 1.28. A function f : F} — Fy is linear if either of the following 
equivalent conditions hold: 


d) fæ +y) = f(x) + fO) forall x, y e F}; 
(2) f(x) =a. x for some a € F3; i.e., f(x) = J jes xi for some S C [n]. 


Exercise 1.26 asks you to verify that the conditions are indeed equivalent. If we 
encode the output of f by +1 € R in the usual way then the “linear” functions 
f : F} — {-1, 1} are precisely the 2” parity functions (X5)sc{nj- 

Let’s think of what it might mean for a function f : F} —> F to be approx- 
imately linear. Definition 1.28 suggests two possibilities: 


a) f@+y) = f(@) + f) for almost all pairs x, y € F3; 
(2’) there is some S C [n] such that f(x) = )0,-5 x; for almost all x € F}. 


Are these equivalent? The proof of (2) = > (1) in Definition 1.28 is “robust”: 
it easily extends to show (2) => (1’) (see Exercise 1.26). But the natu- 
ral proof of (1) = > (2) in Definition 1.28 does not have this robustness 
property. The goal of this section is to show that (1) = (2’) nevertheless 
holds. 

Motivation for this problem comes from an area of theoretical computer sci- 
ence called property testing, which we will discuss in more detail in Chapter 7. 
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Imagine that you have “black-box” access to a function f : F} —> F2, meaning 
that the function f is unknown to you but you can “query” its value on inputs 
x € F; of your choosing. The function f is “supposed” to be a linear function, 
and you would like to try to verify this. 

The only way you can be certain f is indeed a linear function is to query 
its value on all 2” inputs; unfortunately, this is very expensive. The idea behind 
“property testing” is to try to verify that f has a certain property — in this case, 
linearity — by querying its value on just a few random inputs. In exchange for 
efficiency, we need to be willing to only approximately verify the property. 


Definition 1.29. If f and g are Boolean-valued functions we say they are 
e-close if dist( f, g) < €; otherwise we say they are €-far. If Zis a (nonempty) 
property of n-bit Boolean functions we define dist( f, P) = min, g{dist(f, g)}. 
We say that f is €-close to # if dist( f, P) < €; i.e., f is €-close to some g 
satisfying Z. 


In particular, in property testing we take property (2’) above to be the notion 
of “approximately linear”: we say f is €-close to being linear if dist( f, g) < € 
for some truly linear g(x) = J jes xi. 

In 1990 Blum, Luby, and Rubinfeld (Blum et al., 1990) showed that indeed 
(q^) = > (2’) holds, giving the following “test” for the property of linearity 
that makes just 3 queries: 


BLR Test. Given query access to f : FS —> Fo: 


e Choose x ~ F; and y ~ F; independently. 
e Query f at x, y, and x + y. 
e “Accept” if f(x) + f(y) = fœ + y). 


We now show that if the BLR Test accepts f with high probability then f 
is close to being linear. The proof works by directly relating the acceptance 
probability to the quantity }°, f(S J’; see equation (1.10) below. 


Theorem 1.30. Suppose the BLR Test accepts f : F, — Fz with probability 
1 — e. Then f is €-close to being linear. 


Proof. In order to use the Fourier transform we encode f’s output by +1 € R; 
thus the acceptance condition of the BLR Test becomes f(x) f(y) = f(x + y). 
Since 


1 if fx) f(y) = fæ + y), 


ixig@rovjeine 
Za OE TIS la sep 2 FELD, 
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we conclude 
1 — e = PriBLR accepts f] = E [3 + 3 f@fOS@ +y) 


=54+5ELf(@): ELS(y)fO + y) 


5 + 7 EL f(x) (f * fXx)] (by definition) 


5+3 5 F(S)f * F(S) (Plancherel) 


S¢[n] 
=141 fisy (Theorem 1.27). 
S¢[n] 
We rearrange this equality and then continue: 
1-2 = )° fisy (1.10) 
SC[n] 
X Ao 
< h 
< max{f(S)}- D7 FO 
SC{n] 
= max{ f(S)} (Parseval). 
S¢[n] 


But RS) = (f, Xs) = 1 — 2dist( f, xs) (Proposition 1.9). Hence there exists 
some S* C [n] such that 1 — 2e < 1 — 2dist( f, Xs+), i.e., f is €-close to the 
linear function X sx- 


In fact, for small € one can show that f is more like (€/3)-close to linear, 
and this is sharp. See Exercise 1.28. 

The BLR Test shows that given black-box access to f : F5 —> {—1, 1}, we 
can “test” whether f is close to some linear function xs using just 3 queries. The 
test does not reveal which linear function xs is close to (indeed, determining this 
takes at least n queries; see Exercise 1.27). Nevertheless, we can still determine 
the value of x5(x) with high probability for every x € F of our choosing using 
just 2 queries. This property is called local correctability of linear functions. 


Proposition 1.31. Suppose f : F} — {—1, 1} is €-close to the linear func- 
tion xs. Then for every x € F}, the following algorithm outputs x s(x) with 
probability at least 1 — 2e: 


e Choose y ~ F3. 
e Query f at y and x + y. 
e Output f(y) f(x + y). 
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We emphasize the order of quantifiers here: if we just output f(x) then this 
will equal x5(x) for most x; however, the above “local correcting” algorithm 
determines x5(x) (with high probability) for every x. 


Proof. Since y and x + y are both uniformly distributed on F5 (though not 
independently) we have Pr[ f(y) # xs(y)] < € and Pri f(x + y) 4 xs% + 
y)] < € by assumption. By the union bound, the probability of either event 
occurring is at most 2€; when neither occurs, 


FMF +y) = Xs(M)X5 + y) = xs) 


as desired. 
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1.1 Compute the Fourier expansions of the following functions: 

(a) ming : {—1, 1} > {—1, 1}, the minimum function on 2 bits (also 
known as the logical OR function); 

(b) ming : {—1, 1} > {—1, 1} and max; : {—1, 1} > {—1, 1}; 

(c) the indicator function lta} : F} — {0, 1}, where a € F3; 

(d) the density function gja} : F} > R=°, where a € F}; 

(e) the density function @.a+e,; : F3 > R=°, where a eF} and 
ei = (0,...,0,1,0,..., 0) with the 1 in the ith coordinate; 

(f) the density function corresponding to the product probability distri- 
bution on {—1, 1}” in which each coordinate has mean p € [—1, 1]; 

(e) the inner product mod 2 function IP}, : F?” — {-1, 1}, defined by 
Tan (X15 +05 %av Vy «+01 Ya) = (D5 

(h) the equality function Equ, : {—1, 1}" — {0, 1}, defined by Equ, (x) = 1 
if and only if xj = x2 = ++- = Xn; 

(i) the not-all-equal function NAE, : {—1, 1}”" — {0,1}, defined by 
NAE,, (x) = 1 if and only if the bits x1, . . . , x, are not all equal; 

(j) the selection function Sel : {—1, 1} > {—1, 1}, which outputs x2 if 


xı = —1 and outputs x3 if xı = 1; 

(k) mods : p — {0, 1}, which is 1 if and only if the number of 1’s in the 
input is divisible by 3; 

(D OXR : F; — {0,1} defined by OXR(x1, x2, x3) = x1 V (x2 @ x3). 


Here v denotes logical OR, © denotes logical XOR; 
(m) the sortedness function Sort, : {—1, 1} > {—1, 1}, defined by 
Sorty(x) = — 1 if and only if x; < x2 < x3 < x4 Of x1 > X2 È X3 È X4; 
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(n) the hemi-icosahedron function HI : {—1, 1}® — {—1, 1} (also known 
as the Kushilevitz function), defined to be the number of facets labeled 
(+1,+1,+1) in Figure 1.2, minus the number of facets labeled 
(—1, —1, —1), modulo 3. 


X4 
x5 x6 
MwA Figure 1.2. The hemi-icosahedron 
X4 


(Hint: First compute the real multilinear interpolation of the analogue 
HI : {0, 1}© — {0, 1}.) 

(o) the majority functions Maj; : {—1, 1) > {-1,1} and Maj; : 
{-1,1})’ > {-1, 1}; 

(p) the complete quadratic function CQ, : F} —> {—1, 1} defined by 
CQ, (x) = XÈ ies jen XiX j). (Hint: Determine CQ,,(x) as a function 
of the number of 1’s in the input modulo 4. You’ ll want to distinguish 
whether n is even or odd.) 

1.2 How many Boolean functions f : {—1, 1}” —> {—1, 1} have exactly 1 
nonzero Fourier coefficient? 

1.3 Let f : F} — {0, 1} and suppose #{x : f(x) = 1} is odd. Prove that all 
of f’s Fourier coefficients are nonzero. 

1.4 Let f : {—1, 1}" —> R have Fourier expansion f(x) = Ve sci] FS) x, 

Let F : R” — R be the extension of f which is also defined by F(x) = 

S sci] FS) xS. Show that if u = (u1, ..., Un) € [—1, 1]” then 


Fw = BEO), 


where y is the random string in {—1, 1}” defined by having E[y;] = Mi 
independently for alli € [n]. 

1.5 Prove that any f : {—1, 1}” —> {—1, 1} has at most one Fourier coefficient 
with magnitude exceeding 1/2. Is this also true for any f : {—1, 1} ~ R 
with || f|l2 = 1? 


1.6 Use Parseval’s Theorem to prove uniqueness of the Fourier expansion. 


1.7 


1.8 


1.9 


1.10 
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Let f : {—1, 1}" —> {-1, 1} be a random function (i.e., each f(x) is +1 

with probability 1/2, independently for all x € {—1, 1}”). Show that for 

each S C [n], the random variable FS) has mean 0 and variance 2~”. 

(Hint: Parseval.) 

The (Boolean) dual of f : {—1, 1}" — R is the function ft defined by 

fi(x) = —f(—x). The function f is said to be odd if it equals its dual; 

equivalently, if f(—x) = — f(x) for all x. The function f is said to be 

even if f(—x) = f(x) for all x. Given any function f : {—1, 1}” —> R, 

its odd part is the function f°% : {—1, 1}” > R defined by f° (x) = 

(f(x) — f(—x))/2, and its even part is the function f°" : {—1, 1” > R 

defined by f(x) = (f(x) + f(—x)/2. 

(a) Express f*(S) in terms of f(S). 

(b) Verify that f = f°% + f°" and that f is odd (respectively, even) if 
and only if f = f°% (respectively, f = f°"). 

(c) Show that 


PS SI FPS FO, 


SC{n] SC[n] 
|S| odd |S] even 


In this problem we consider representing False, True as 0, 1 € R. 

(a) Using the interpolation method from Section 1.2, show that every 
f : {False, True}” — {False, True} can be represented as a real multi- 
linear polynomial 


qx) = J es] [x (1.11) 


SC[n] ieS 


“over {0, 1}”, meaning mapping {0, 1}” — {0, 1}. 

(b) Show that this representation is unique. (Hint: If q as in (1.11) has at 
least one nonzero coefficient, consider g(a) where a € {0, 1}” is the 
indicator vector of a minimal S with cs Æ 0.) 

(c) Show that all coefficients cs in the representation (1.11) will be inte- 
gers in the range [—2”, 2”]. 

(d) Let f : {False, True}” — {False, True}. Let p(x) be f’s multilinear 
representation when False, True are 1, —1 € R (i.e., p is the Fourier 
expansion of f) and let g(x) be f’s multilinear representation when 
False, True are 0,1 € R. Show that g(x) = 5 — jpa —2x1,..., 
1 — 2x,). 

Let f : {-1, 1}” —> Rbenotidentically 0. The (real) degree of f , denoted 

deg( f), is defined to be the degree of its multilinear (Fourier) expansion; 

i.e., max{|S|: f(S) £ 0}. 
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(a) Show that deg( f) = deg(a + bf) for any a, b € R (assuming b Æ 0, 


a+bf #0). 

(b) Show that deg(f) < k if and only if f is a real linear combination 
of functions g1, ..., gs, each of which depends on at most k input 
coordinates. 


(c) Which functions in Exercise 1.1 have “nontrivial” degree? (Here 


f :{—1, 1}" — R has “nontrivial” degree if deg( f) < n.) 


Suppose that f : {—1, 1}} > {—1, 1} has deg( f) = k > 1. 
(a) Show that f’s real multilinear representation over {0, 1} (see Exer- 


cise 1.9), call it g(x), also has deg(q) = k. 


(b) Using Exercise 1.9(c),(d), deduce that f’s Fourier spectrum is 


“2!-k_oranular”, meaning each f(S) is an integer multiple of 
IK, 


(c) Show that X scp [FOO < 2. 

A Hadamard Matrix is any N x N real matrix with +1 entries and 
orthogonal rows. Particular examples are the Walsh-Hadamard Matrices 
Hy, inductively defined for N = 2” as follows: 


Hı = [1] > Hyan = a Hy | ; 


Ayn — Ann 


(a) Lets index the rows and columns of H» by the integers 


(b 


(c 


wm 


< 


{0, 1,2,...,2” — 1} rather than [2”]. Further, let’s identify such an 
integer i with its binary expansion (io, i1, ..., in—1) € F5, where ig is 
the least significant bit and i,_; the most. For example, if n = 3, we 
identify the index i = 6 with (0, 1, 1). Now show that the (y, x) entry 
of Ha is (—1)’*. 

Show that if f : F; —> R is represented as a column vector in R” 
(according to the indexing scheme from part (a)) then 27” Ha f = F. 
Here we think of f as also being a function F} — R, identifying 
subsets S$ C {0,1,..., — 1} with their indicator vectors. 

Show how to compute Ha. f using just n2” additions and subtractions 
(rather than 2?” additions and subtractions as the usual matrix-vector 
multiplication algorithm would require). This computation is called 
the Fast Walsh-Hadamard Transform and is the method of choice for 
computing the Fourier expansion of a generic function f : F} —> R 
when n is large. 


(d) Show that taking the Fourier transform is essentially an “involution”: 


f =X" f (using the notations from part (b)). 


1.13 Let f : {—1, 1}" > RandletO < p < q < œ. Show that || f llp < Il fllq- 
(Hint: Use Jensen’s inequality with the convex function t +> 14/?.) 
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Extend the inequality to the case q = œ, where || f||.. is defined to 
be maxxe(—1,1 {| f)|}- 

1.14 Compute the mean and variance of each function from Exercise 1.1. 

1.15 Let f:{—1,1}" > R. Let K C [n] and let z € {—1,1}*. Suppose 
g : {—1, 1}\* — R is the subfunction of f gotten by restricting the 
K-coordinates to be z. Show that E[g] = Dirck f(T) zi. 

1.16 If f:{-1,1}" > {-1,1}, show that Var[f] =4- dist(f, 1)- 
dist( f, —1). Deduce Proposition 1.15. 

1.17 Extend Fact 1.14 by proving the following: If F is a {—1, 1}-valued 
random variable with mean u then 


Var[F] = E[(F — 1)°] = 5 E[(F — F')] = 2Pr[F # F'] 
= E[|F — yl], 


where F’ is an independent copy of F. 
1.18 For any f : {—1, 1}" — R, show that 


W Ef] ifk=2, 


=k A RES 
DENI 0 if k #2. 


1.19 Let f : {-1, 1” > {-1, 1}. 
(a) Suppose W![f] = 1. Show that f(x) = +xs for some |S| = 1. 
(b) Suppose W=![f] = 1. Show that f depends on at most 1 input coor- 
dinate. 
(c) Suppose W£?°[ f] = 1. Must f depend on at most 2 input coordinates? 
At most 3 input coordinates? What if we assume W?[f] = 1? 
1.20 Let f:{—-1,1}} >R satisfy f= f='. Show that Var[f?]= 
Li; FO FOY. 
Prove that there are no functions f : {—1, 1}” —> {—1, 1} with exactly 2 
nonzero Fourier coefficients. What about exactly 3 nonzero Fourier coef- 


1.2 


$k 


ficients? 
1.22 Verify Propositions 1.25 and 1.26. 


1.23 In this exercise you will prove some basic facts about “distances” between 
probability distributions. Let y and y be probability densities on F3. 
(a) Show that the total variation distance between ¢ and y, defined by 


\, 


drv(, Y) = max{]Priy € A]— Pry € A] 
ACF; Uly~e y~y 


is equal to Al — wll. 
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1.28 


1.29 


1 Boolean Functions and the Fourier Expansion 


(b) Show that the collision probability of g, defined to be 


Pr [y= y], 


Im 


YY ~E 
independently 


is equal to ||g||5/2". 
(c) The x?-distance of g from y is defined by 


date. y) = E[(2% —1)'] 


assuming y has full support. Show that the y?-distance of g from 
uniform is equal to Var[g]. 

(d) Show that the total variation distance of gy from uniform is at most 
+/Var[¢]. 

Let A C {—1, 1}” have “volume” ô, meaning E[1,4] = ô. Suppose ¢ is 

a probability density supported on A, meaning g(x) = 0 when x ¢ A. 

Show that loll3 > 1/6 with equality if p = p4, the uniform density 

on A. 

Show directly from the definition that the convolution operator is asso- 

ciative and commutative. 

Verify that (1) <=> (2) in Definition 1.28. 

Suppose an algorithm is given query access to a linear function 

f : F} — F and its task is to determine which linear function f is. 

Show that querying f on n inputs is necessary and sufficient. 

(a) Generalize Exercise 1.5 as follows: Let f : F} — {—1, 1} and sup- 
pose that dist( f, Xs+) = ô. Show that IFO] < 26 for all SA S*. 
(Hint: Use the union bound.) 

(b) Deduce that the BLR Test rejects f with probability at least 35 — 
108? + 88°. 

(c) Show that this lower bound cannot be improved to cô — O (8?) for 
any c > 3. 

(a) We call f : F3 —> F3 an affine function if f(x) = a - x + b for some 
a € F}, b € Fo. Show that f is affine if and only if f(x) + f(y) + 
f) = fœ +y +z) forall x, y, z, € F} 

(b) Let f : F} — R. Suppose we choose x, y, z ~ F} independently and 
uniformly. Show that E[ f(x) f(y) fi) f(x t+yt+ 2) = s fisy’. 

(c) Give a 4-query test for a function f : F} —> Fz with the following 
property: if the test accepts with probability 1 — e then f is e-close 
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to being affine. All four query inputs should have the uniform distri- 

bution on F; (but of course need not be independent). 

(d) Give an alternate 4-query test for being affine in which three of the 
query inputs are uniformly distributed and the fourth is not random. 
(Hint: Show that f is affine if and only if f(x) + fO)+ fO) = 
f(x + y) for all x, y € F3.) 

1.30 Permutations x € S, act on strings x € {—1, 1}” in the natural way: 
(x”); = Xx). They also act on functions f : {—1, 1}" —> R via f” (x) = 
f(x”) for all x € {—1, 1}”. We say that functions g, h: {-1,1}" > 
{—1, 1} are (permutation-)isomorphic if g = h” forsomez € S,. We call 
Aut(f) = {x E S, : f7 = f} the (permutation-)automorphism group 
of f. 

(a) Show that F? (S) = f(~!(S)) for all S € [n]. 

For future reference, when we write (fis ))isı=x, We mean the sequence 
of degree-k Fourier coefficients of f, listed in lexicographic order of the 
k-sets S. 

Given complete truth tables of some g and h we might wish to 
determine whether they are isomorphic. One way to do this would 
be to define a canonical form can(f) : {—1, 1}" > {-1, 1} for each 
f :{-1, 1} — {-1, 1}, meaning that: (1) can( f) is isomorphic to f; 
(ii) if g is isomorphic to h then can(g) = can(h). Then we can determine 
whether g is isomorphic to h by checking whether can(g) = can(h). Here 
is one possible way to define a canonical form for f: 


1. Set Po = Sp. 
2. For each k = 1, 2,3,...,n, 
3. Define P, to be the set of all m € Py_; that make the sequence 
FFS ))is=x maximal in lexicographic order on RÓ, 

4. Let can( f) = f” for (any) m € Py. 

(b) Show that this is well-defined, meaning that can( f) is the same func- 
tion for any choice of m € P. 

(c) Show that can( f ) is indeed a canonical form; i.e., it satisfies (i) and (ii) 
above. 

(d) Show that if Fa Th); Sees Fi{n}) are distinct numbers then can( f) can 
be computed in O (2”) time. 

(e) We could more generally consider g, h : {—1, 1}” — {—1, 1} to be 


isomorphic if g(x) = h(4xq(1),..-, £Xz(n)) for some permutation 7 
on [n] and some choice of signs. Extend the results of this exercise to 
handle this definition. 
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Notes 


The Fourier expansion for real-valued Boolean functions dates back to Walsh (Walsh, 
1923) who introduced a complete orthonormal basis for L?({0, 1) consisting 
of +1-valued functions, constant on dyadic intervals. Using the ordering intro- 
duced by Paley (Paley, 1932), the nth Walsh basis function w, : [0,1] —> {—1, 1} is 
defined by w, (x) = [ [72o ri(x)™, where n = )>;°ynj2! and r;(x) (the “ith Rademacher 
function at x”) is defined to be (—1)", with x = °°, x;2-) for non-dyadic 
x € [0,1]. Walsh’s interest was in comparing and contrasting the properties of 
this basis with the usual basis of trigonometric polynomials and also Haar’s basis 
(Haar, 1910). 

The first major study of the Walsh functions came in the remarkable paper of 
Paley (Paley, 1932), which included strong results on the L?-norms of truncations of 
Walsh series. Sadly, Paley died in an avalanche one year later (at age 26) while skiing 
near Banff. The next major development in the study of Walsh series was conceptual, 
with Vilenkin (Vilenkin, 1947) and Fine (Fine, 1949) independently suggesting the more 
natural viewpoint of the Walsh functions as characters of the discrete group Z5. There 
was significant subsequent work in the 1950s and 1960s, but it’s somewhat unnatural 
from our point of view because it relies fundamentally on ordering the Rademacher 
and Walsh functions according to binary expansions. Bonami (Bonami, 1968) and 
Kiener (Kiener, 1969) seem to have been the first authors to take our viewpoint, treating 
bits x1, x2, x3, . .. symmetrically and ordering Fourier characters xs according to |S] 
rather than max($). Bonami also obtained the first hypercontractivity result for the 
Boolean cube. This proved to be a crucial tool for analysis of Boolean functions; see 
Chapter 9. For an early survey on Walsh series, see Balashov and Rubinshtein (Balashov 
and Rubinshtein, 1973). 

Turning to Boolean functions and computer science, the idea of using Boolean 
logic to study “switching functions” (as engineers originally called Boolean functions) 
dates to the late 1930s and is usually credited to Nakashima (Nakashima, 1935), Shan- 
non (Shannon, 1937), and Shestakov (Shestakov, 1938). Muller (Muller, 1954b) seems 
to be the first to have used Fourier coefficients in the study of Boolean functions; he 
mentions computing them while classifying all functions f : {0, 1}* > {0, 1} up to cer- 
tain equivalences. The first publication devoted to Boolean Fourier coefficients was by 
Ninomiya (Ninomiya, 1958), who expanded on Muller’s use of Fourier coefficients for 
the classification of Boolean functions up to various isomorphisms. Golomb (Golomb, 
1959) independently pursued the same project (his work is the content of Exercise 1.30); 
he was also the first to recognize the connection to Walsh series. The use of “Fourier— 
Walsh analysis” in the study of Boolean functions quickly became well known in the 
early 1960s. Several symposia on applications of Walsh functions took place in the 
early 1970s, with Lechner’s 1971 monograph (Lechner, 1971) and Karpovsky’s 1976 
book (Karpovsky, 1976) becoming the standard references. However, the use of Boolean 
analysis in theoretical computer science seemed to wane until 1988, when the outstand- 
ing work of Kahn, Kalai, and Linial (Kahn et al., 1988) ushered in a new area of 
sophistication. 

The original analysis by Blum, Luby, and Rubinfeld (Blum et al., 1990) for their 
linearity test was combinatorial; our proof of Theorem 1.30 is the elegant analytic one 
due to by Bellare, Coppersmith, Hastad, Kiwi, and Sudan (Bellare et al., 1996). In fact, 
the essence of this analysis appears already in the 1953 work of Roth (Roth, 1953) 
(in the context of the cyclic group Zy rather than F}). The work of Bellare et al. also 
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gives additional analysis improving the results of Theorem 1.30 and Exercise 1.28. See 
also the work of Kaufman, Litsyn, and Xie (Kaufman et al., 2010) for further slight 
improvement. 

In Exercise 1.1, the sortedness function was introduced by Ambainis (Ambainis, 
2003; Laplante et al., 2006); the hemi-icosahedron function was introduced by Kushile- 
vitz (Nisan and Wigderson, 1995). The fast algorithm for computing the Fourier trans- 
form mentioned in Exercise 1.12 is due to Lechner (Lechner, 1963). 


2 


Basic Concepts and Social Choice 


In this chapter we introduce a number of important basic concepts including 
influences and noise stability. Many of these concepts are nicely motivated 
using the language of social choice. The chapter is concluded with Kalai’s 
Fourier-based proof of Arrow’s Theorem. 


2.1. Social Choice Functions 


In this section we describe some rudiments of the mathematics of social choice, 
a topic studied by economists, political scientists, mathematicians, and com- 
puter scientists. The fundamental question in this area is how best to aggregate 
the opinions of many agents. Examples where this problem arises include cit- 
izens voting in an election, committees deciding on alternatives, and indepen- 
dent computational agents making collective decisions. Social choice theory 
also provides very appealing interpretations for a number of important functions 
and concepts in the analysis of Boolean functions. 

A Boolean function f : {—1, 1}” —> {—1, 1} can be thought of as a voting 
rule or social choice function for an election with 2 candidates and n voters; 
it maps the votes of the voters to the winner of the election. Perhaps the most 
familiar voting rule is the majority function: 


Definition 2.1. For n odd, the majority function Maj, : {—1, 1}” > {-1, 1} 
is defined by Maj, (x) = sgn(x; + x2 + -+ - + xn). (Occasionally, for n even 
we say that f is a majority function if f(x) equals the sign of x; +---+ Xn 
whenever this number is nonzero.) 


The Boolean AND and OR functions correspond to voting rules in which a 
certain candidate is always elected unless all voters are unanimously opposed. 
Recalling our somewhat nonintuitive convention that — 1 represents True and +1 
represents False: 
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Definition 2.2. The function AND, : {—1, 1}” —> {-1,1} is defined 
by AND,(x)=+1 unless x =(-1,-1,...,—1). The function 
OR, : {-1, 1” — {-1, 1} is defined by OR,(x)=-—1_ unless 
x =(+1,4+1,...,+1). 


Another voting rule commonly encountered in practice: 


Definition 2.3. The ith dictator function x; : {—1, 1}" —> {—1, 1} is defined 
by xi(x) = xi. 


Here we are simplifying notation for the singleton monomial from xy; to 
Xi- Even though they are extremely simple functions, the dictators play a very 
important role in analysis of Boolean functions; to highlight this we prefer 
the colorful terminology “dictator functions” to the more mathematically staid 
“projection functions”. Generalizing: 


Definition 2.4. A function f : {—1, 1}” —> {-1, 1}is called a k-junta for k € N 
if it depends on at most k of its input coordinates; i.e., f(x) = 8(Xi, -<-s Xip) 
for some g : {—1, 1} — {-1, l} and i;,..., ig € [n]. Informally, we say that 
f is a “junta” if it depends on only a “constant” number of coordinates. 


For example, the number of functions f : {—1, 1}” —> {—1, 1} which are 
1-juntas is precisely 2n + 2: the n dictators, the n negated-dictators, and the 2 
constant functions +1. 

The European Union’s Council of Ministers adopts decisions based on a 
weighted majority voting rule: 


Definition 2.5. A function f : {—1, 1}" —> {-1, 1} 1s called a weighted major- 
ity or (linear) threshold function if it is expressible as f(x) = sgn(dp + a,x; + 
+--+ a,Xx,) for some do, 41, ..., an E R. 


Exercise 2.2 has you verify that majority, AND, OR, dictators, and constants 
are all linear threshold functions. 

The leader of the United States (and many other countries) is elected via a 
kind of “two-level majority”. We make a natural definition along these lines: 


Definition 2.6. The depth-d recursive majority of n function, denoted Maj®“, 
is the Boolean function of n? bits defined inductively as follows: Maj®! = 
Maj,,, and Maj2@t P(x, ©... x™) = Maj, (Maj24(x), ..., Maj24(x™)) for 
x e {-1, 1)". 


In our last example of a 2-candidate voting rule, the voters are divided into 
“tribes” of equal size and the outcome is True if and only if at least one tribe is 
unanimously in favor of True. This rule is only somewhat plausible in practice, 
but it plays a very important role in the analysis of Boolean functions: 
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Definition 2.7. The tribes function of width w and size s, 
Tribes,,; : {—1, 1} > {-1, 1}, is defined by Tribes, .(x,...,x) = 
OR,(AND,, (x), ..., AND, (x“)), where x® € {—1, 1}”. 


Here are some natural properties of 2-candidate social choice functions 
which may be considered desirable: 


Definition 2.8. We say that a function f : {—1, 1}” —> {-1, l}is: 


e monotone if f(x) < f(y) whenever x < y coordinate-wise; 

e odd it f(—x) = —f(x); 

e unanimous if f(1,...,1)= l and f(—1,...,—1) = —1; 

e symmetric if f(x”) = f(x) for all permutations x € S, (using the notation 
from Exercise 1.30); i.e., f(x) only depends on the number of 1’s in x. 


The definitions of monotone, odd, and symmetric are also natural for f : 
{-1,1}/" > R. 


Example 2.9. The majority function (for n odd) has all four properties in 
Definition 2.8; indeed, May’s Theorem (Exercise 2.3) states that it is the only 
monotone, odd, symmetric function. The dictator functions have the first three 
properties above, as do recursive majority functions. The AND and OR func- 
tions are monotone, unanimous, and symmetric, but not odd. The tribes func- 
tions are monotone and unanimous; although they are not symmetric they have 
an important weaker property: 


Definition 2.10. A function f : {—1, 1}” —> {—1, 1} is transitive-symmetric if 
for all i, i’ € [n] there exists a permutation m € S, taking i toi’ such f(x”) = 
f(x) for all x € {-1, 1}”. 


Intuitively, a function is transitive-symmetric if any two coordinates i, j € [n] 
are “equivalent”. 

One more natural desirable property of a 2-candidate voting rule is that it be 
unbiased as defined in Chapter 1.4, i.e., “equally likely” to elect +1. Of course, 
this presupposes the uniform probability distribution on votes. 


Definition 2.11. The impartial culture assumption is that the n voters’ prefer- 
ences are independent and uniformly random. 


Although this assumption might seem somewhat unrealistic, it gives a good 
basis for comparing voting rules in the absence of other information. One 
might also consider it as a model for the votes of just the “undecided” or 
“party-independent” voters. 
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(+1,+1,+1) 


Figure 2.1. Boundary edges of the Maj, 
function 


aaa 


2.2. Influences and Derivatives 


Given a voting rule f : {—1, 1}” —> {—1, 1} it’s natural to try to measure the 
“influence” or “power” of the ith voter. One can define this to be the “probability 
that the ith vote affects the outcome”. 


Definition 2.12. We say that coordinate i € [n] is pivotal for f : {—1, 1}" > 
{—1, 1} on input x if f(x) 4 f(x®). Here we have used the notation x®! for 
the string (X1,..-,Xi-1, —Xi, Be eee Xn). 


Definition 2.13. The influence of coordinate i on f : {—1, 1}" —> {—1, 1} is 
defined to be the probability that i is pivotal for a random input: 


Inf;[f]= Pr E) Sa]. 


Influences can be equivalently defined in terms of “geometry” of the Ham- 
ming cube: 


Fact 2.14. For f : {—1, 1}" —> {—1, 1}, the influence Inf;[ f] equals the frac- 
tion of dimension-i edges in the Hamming cube which are boundary edges. Here 
(x, y) is a dimension-i edge if y = x®; it is a boundary edge if f(x) £ f(y). 


Example 2.15. For the ith dictator function x; we have that coordinate i is 
pivotal for every input x; hence Infj;[x;] = 1. On the other hand, if j Æ i 
then coordinate j is never pivotal; hence Inf;[x;] = 0 for j # i. Note that 
the same two statements are true about the negated-dictator functions. For the 
constant functions +1, all influences are 0. For the OR, function, coordinate 1 
is pivotal for exactly two inputs, (—1, 1, 1,..., 1) and (1, 1, 1,..., 1); hence 
Inf; [OR,,] = 2'~”. Similarly, Inf;[OR,,] = Inf;[AND,] = 2!” for alli € [n]. 
The Maj, is depicted in Figure 2.1; the points where it’s +1 are colored gray and 
the points where it’s —1 are colored white. Its boundary edges are highlighted 
in black; there are 2 of them in each of the 3 dimensions. Since there are 4 
total edges in each dimension, we conclude Inf;[Maj,] = 2/4 = 1/2 for all 
i € [3]. For majority in higher dimensions, Inf; [Maj,,] equals the probability 
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that among n — 1 random bits, exactly half of them are 1. This is roughly ve 


for large n; see Exercise 2.22 or Chapter 5.2. 


Influences can also be defined more “analytically” by introducing the deriva- 
tive operators. 


Definition 2.16. The ith (discrete) derivative operator D; maps the function 
f :{-1, 1}" — R to the function D; f : {—1, 1}" —> R defined by 


fel ere) 
; . 


Di f(x) = 


Here we have used the notation x“~”) = (x1, ..., Xi-1, b, Xi41, -< <, Xn). Notice 
that D; f (x) does not actually depend on x;. The operator D; is a linear operator: 
i.e., Di(f + 8) = Di f + Dig. 


If f : {-1, 1}” — {-1, 1} is Boolean-valued then 


0 if coordinate i is not pivotal for x, 
Di f(x) = irre (2.1) 
+1 if coordinate i is pivotal for x. 


Thus D; f(x)? is the 0-1 indicator for whether i is pivotal for x and we con- 
clude that Inf;[f] = E[D; f (x)?]. We take this formula as a definition for the 
influences of real-valued Boolean functions. 


Definition 2.17. We generalize Definition 2.13 to functions f : {—1, 1}" > R 
by defining the influence of coordinate i on f to be 


Inf] =F [Di fey") = ID fl. 


Definition 2.18. We say that coordinate i € [n] is relevant for f : {—1, 1} > 
R if and only if Inf;[f] > 0; i.e., fal?) Æ f(x? —) for at least one 
x e {1,1}. 


The discrete derivative operators are quite analogous to the usual par- 
tial derivatives. For example, f : {—1, 1}” —> R is monotone if and only if 
D; f(x) = 0 for all i and x. Further, D; acts like formal differentiation on 
Fourier expansions: 


Proposition 2.19. Let f : {—1,1}" —> R have the multilinear expansion 
FS Z sch] f(S)x*. Then 


Dif@) = X fsx". (2.2) 


SC[n] 
S>di 


2.2. Influences and Derivatives 31 


Proof. Since D; is a linear operator, the claim follows immediately from the 
observation that 


xD jfies, 
0 ifi g S. 


D,x* = 


By applying Parseval’s Theorem to the Fourier expansion (2.2), we obtain a 
Fourier formula for influences: 


Theorem 2.20. For f : {—1, 1}" ~ Randi € [n], 
Infi[f] = X ASP. 


Sdi 
In other words, the influence of coordinate i on f equals the sum of f’s 
Fourier weights on sets containing 7. This is another good example of being 
able to “read off” an interesting combinatorial property of a Boolean function 
from its Fourier expansion. In the special case that f : {—1, 1}” —> {—1, 1} is 
monotone there is a much simpler way to read off its influences: they are the 
degree-1 Fourier coefficients. In what follows, we write fli ) in place of fii }). 


Proposition 2.21. If f : {—1, 1} ~ {-1, 1} is monotone, then Inf;[ f] = 
f(). 
Proof. By monotonicity, the +1 in (2.1) is always 1; i.e., D; f(x) is the 0-1 


indicator that i is pivotal for x. Hence Inf;[f] = E[D; f] = D; f = f), 
where the third equality used Proposition 2.19. 


This formula allows us a neat proof that for any 2-candidate voting rule that 
is monotone and transitive-symmetric, all of the voters have small influence: 


Proposition 2.22. Let f : {—1, 1}" —> {-1, 1} be transitive-symmetric and 
monotone. Then Inf;[ f] < 1/./n for alli € [n]. 


Proof. Transitive-symmetry of f implies that fli )= fii’) for all i, i’ € [n] 
(using Exercise 1.30(a)); thus by monotonicity, Inf;[,f] = fii) = fO) for 
all i € [n]. But by Parseval, 1 = >, f(S)? > Xi FOP = nf()?; hence 
FQ) < Vin. 


This bound is slightly improved in Proposition 2.58 and Exercise 2.24. 

The derivative operators are very convenient for functions defined on 
{—1, 1}” but they are less natural if we think of the Hamming cube as 
{True, False}”; for the more general domains we’ll look at in Chapter 8 they don’t 
even make sense. We end this section by introducing some useful definitions 
that will generalize better later. 
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Definition 2.23. The ith expectation operator E; is the linear operator on 
functions f : {—1, 1}” —> R defined by 


E; f(x) = El f(x, vey Xj-1, Xi, Xi4+1, raan): 


Whereas D; f isolates the part of f depending on the ith coordinate, E; f 
isolates the part not depending on the ith coordinate. Exercise 2.15 asks you to 
verify the following: 


Proposition 2.24. For f :{—1, 1} ~ R, 


f(x D) + FED) 


°. Ef@= 5 
EAO >) fF Oz: 

SZi 
© f(x) = xD; fæ) + Bi fo). 


Note that in the decomposition f = x;D; f + E; f, neither D; f nor E; f 
depends on x;. This decomposition is very useful for proving facts about 
Boolean functions by induction on n. 

Finally, we will also define an operator very similar to D; called the ith 
Laplacian: 


Definition 2.25. The ith coordinate Laplacian operator L; is defined by 
Lif=f—-Ef. 
Notational warning: Elsewhere you might see the negated definition, E; f — f. 
Exercise 2.16 asks you to verify the following: 
Proposition 2.26. For f :{—1, 1}"” ~ R, 


a Bi 
diya E a 


“LJO =A fO= fOr, 
Si 
* (ALS) = (Lif, Lif) = nif]. 


2.3. Total Influence 


A very important quantity in the analysis of a Boolean function is the sum of 
its influences. 
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Definition 2.27. The total influence of f : {—1, 1}" —> R is defined to be 
If] => f [ f]. 
i=l 


For Boolean-valued functions f : {—1, 1}” —> {—1, 1} the total influence 
has several additional interpretations. First, it is often referred to as the average 
sensitivity of f because of the following proposition: 


Proposition 2.28. For f : {—1, 1}" > {-1, 1} 
ILF] = Elsens;(x)], 


where sens s(x) is the sensitivity of f at x, defined to be the number of pivotal 
coordinates for f on input x. 


Proof. 


If] =o mtf] = Do Prif 4 fa) 
i=l i=l 


= $ Ell pares] = E be Line = Efsens ;(x)]. 
i=1 


i=l 


The total influence of f : {—1, 1}” —> {—1, 1} is also closely related to the 
size of its edge boundary; from Fact 2.14 we deduce: 


Fact 2.29. The fraction of edges in the Hamming cube {—1, 1}" which are 
boundary edges for f : {—1, 1}" — {-1, 1} is equal to ‘IE fl. 


Example 2.30. (Recall Example 2.15.) For Boolean-valued functions f : 
{—1, 1}" > {-1, 1} the total influence ranges between 0 and n. It is minimized 
by the constant functions +1 which have total influence 0. It is maximized by 
the parity function x;,; and its negation which have total influence n; every 
coordinate is pivotal on every input for these functions. The dictator functions 
(and their negations) have total influence 1. The total influence of OR, and 
AND, is very small: n2!™”. On the other hand, the total influence of Maj, is 


fairly large: roughly ./2/7./n for large n. 


By virtue of Proposition 2.21 we have another interpretation for the total 
influence of monotone functions: 
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Proposition 2.31. If f : {—1, 1}" —> {-1, 1} is monotone, then 
i=) 70, 
i=l 
This sum of the degree-1 Fourier coefficients has a natural interpretation in 


social choice: 


Proposition 2.32. Let f : {—1,1}" — {-1,1} be a voting rule for a 
2-candidate election. Given votes x = (x1, ..., Xn), let w be the number of 
votes that agree with the outcome of the election, f(x). Then 


n 1 Ox. 
Elw]=5+5 2 fO. 
Proof. By the formula for Fourier coefficients, 
DAW = J EISE] = RISE HHn) (23) 
i=l i=l 


Now xı +--+ xn equals the difference between the number of votes for 
candidate 1 and the number of votes for candidate —1. Hence f(x)(x; + 
-++ + Xn) equals the difference between the number of votes for the winner 
and the number of votes for the loser; i.e., w — (n — w) = 2w — n. The result 
follows. 


Rousseau (Rousseau, 1762) suggested that the ideal voting rule is one which 
maximizes the number of votes that agree with the outcome. Here we show 
that the majority rule has this property (at least when n is odd): 


Theorem 2.33. The unique maximizers of `; fi) among all 
f :{-1, 1)" > {-1,1} are the majority functions. In particular, 


If] < I[Maj,] = /2/2./n + O(n?) for all monotone f. 


Proof. From (2.3), 


Do FO = Elf) HxH H xn) < Elle +x H nl], 


i=l 
since f(x) € {—1, 1} always. Equality holds if and only if f(x) = sgn(xı + 
<+- + Xn) whenever x; + +- + x, 40. The second statement of the theorem 
follows from Proposition 2.31 and Exercise 2.22. 
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Let’s now take a look at more analytic expressions for the total influence. 
By definition, if f : {—1, 1}” —> R, then 


If] = > mf] = 0 EID; fœ] E È psa | . (24) 
i=1 i=1 i=1 


This motivates the following definition: 


Definition 2.34. The (discrete) gradient operator V maps the function f : 
{—1, 1}" — R to the function V f : {—1, 1}” —> R” defined by 


V f(x) = Di fx), D2 fŒ), ..-, Dn ff). 


Note that for f : {—1, 1}” —> {—1, 1} wehave IV fI3 = sens (x), where 
|| - lz is the usual Euclidean norm in R”. In general, from (2.4) we deduce: 


Proposition 2.35. For f : {—1, 1 —> R, 
If] = ELV fœ). 
An alternative analytic definition involves introducing the Laplacian: 


Definition 2.36. The Laplacian operator L is the linear operator on functions 
f :{-1, 1}" > R defined by L = )°"_, Li. 


Exercise 2.17 asks you to verify the following: 
Proposition 2.37. For f : {—1, 1}" —> R, 
e Lf(x) = (n/2)(f (a) — avg{ f(x®)}), 
ie{n] 


e Lf) = f(x): sensp(x) if f : {-1, D" > {-1, 1}, 
e Lf= >> ISIf(S) xs 


SC[n] 


e (FLP) =f]. 


We can obtain a Fourier formula for the total influence of a function using 
Theorem 2.20; when we sum that theorem over all i € [n] the Fourier weight 
f(S)* is counted exactly |S] times. Hence: 


Theorem 2.38. For f : {—1, 1}" —> R, 
U= D> SIAS? = Dok WTSI. (2.5) 
SC[n] k=0 


For f : {—1, 1}" — {-1, 1} we can express this using the spectral sample: 


If] = sE USI. 
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Thus the total influence of f : {—1, 1}” —> {—1, 1} also measures the aver- 
age “height” or degree of its Fourier weights. 

Finally, from Proposition 1.13 we have Var[ f] = >°,.9 WÉI f]; comparing 
this with (2.5) we immediately deduce a simple but important fact called the 
Poincaré Inequality. 


Poincaré Inequality. For any f : {—1, 1}" > R, Var[ f] < I[ f]. 


Equality holds in the Poincaré Inequality if and only if all of f’s Fourier 
weight is at degrees 0 and 1; i.e., WS![f] = E[f?]. For Boolean-valued 
f :{-1, 1}" > {-1, 1}, Exercise 1.19 tells us this can only occur if f = +1 
or f = +x; for some i. 

For Boolean-valued f : {—1, 1}" — R, the Poincaré Inequality can be 
viewed as an (edge-)isoperimetric inequality, or (edge-)expansion bound, for 
the Hamming cube. If we think of f as the indicator function for a set 
A C {-1, 1}” of “measure” æ = |A|/2”, then Var[ f] = 4a(1 — æ) (Fact 1.14) 
whereas I[ f] is n times the (fractional) size of A’s edge boundary. In particu- 
lar, the Poincaré Inequality says that subsets A C {—1, 1}” of measure æ = 1/2 
must have edge boundary at least as large as those of the dictator sets. 

For a ¢ {0, 1/2, 1} the Poincaré Inequality is not sharp as an edge- 
isoperimetric inequality for the Hamming cube; for small œ even the asymptotic 
dependence is not optimal. Precisely optimal edge-isoperimetric results (and 
also vertex-isoperimetric results) are known for the Hamming cube. The fol- 
lowing simplified theorem is optimal for œ of the form 27!: 


Theorem 2.39. For f :{—1,1}"—> {-1,1} with œ = min{Pr[f = 1], 
Pri f = —1)}, 


2a log(1/æ) < If]. 


This result illustrates an important recurring concept in the analysis of 
Boolean functions: The Hamming cube is a “small-set expander’. Roughly 
speaking, this is the idea that “small” subsets A C {—1, 1}” have unusually 
large “boundary size”. 


2.4. Noise Stability 


Suppose f : {—1, 1}” —> {—1, 1} is a voting rule for a 2-candidate election. 
Making the impartial culture assumption, the n voters independently and 
uniformly randomly choose their votes x = (x1, ..., Xn). Now imagine that 
when each voter goes to the ballot box there is some chance that their vote is 
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misrecorded. Specifically, say that each vote is correctly recorded with proba- 
bility pọ € [0, 1] and is garbled —i.e., changed to a random bit — with probability 
1 — p. Writing y = (y,,..., Yn) for the votes that are finally recorded, we may 
ask about the probability that f(x) = f(y), i.e., whether the misrecorded votes 
affected the outcome of the election. This has to do with the noise stability 


of f. 


Definition 2.40. Let p € [0, 1]. For fixed x € {—1, 1}” we write y ~ N,(x) to 
denote that the random string y is drawn as follows: for each i € [n] indepen- 
dently, 


Jj 


ojx with probability p, 
uniformly random with probability 1 — p. 


We extend the notation to all ọ € [—1, 1] as follows: 
xi with probability } + lo, 
Ga —x; with probability 5 — 5 p. 


We say that y is p-correlated to x. 


Definition 2.41. If x ~ {—1, 1} is drawn uniformly at random and then 
y ~ N,(x), we say that (x, y) is a p-correlated pair of random strings. This 
definition is symmetric in x and y; it is equivalent to saying that independently 
for each i € [n], the pair of random bits (x;, y;) satisfies E[x;] = E[y;] = 0 
and E[x; y;] = p. 


With these definitions in hand we can now define the important concept of 
noise stability, which measures the correlation between f(x) and f(y) when 
(x, y) is a p-correlated pair. 

Definition 2.42. For f : {—1, 1}” —> Rand p € [—1, 1], the noise stability of 
f at pis 
Stab [f] = ate Lf) f(y). 
jabs 
If f : {-1, 1}” — {-1, 1} we have 
Stab [f]= Pr [f= SO) — Pe Lf) # SO) 


p-correlated p-correlated 


=2 Pr [fŒ = fO- 1. 
(x,y) 


p-correlated 
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In the voting scenario described above, the probability that the misrecording 
of votes doesn’t affect the election outcome is 5 + $Stab,[ fi. 

When p is close to 1 (i.e., the “noise” is small) it’s sometimes more natural 
to ask about the probability that reversing a small fraction of the votes reverses 
the outcome of the election. 


Definition 2.43. For f : {—1, 1}" —> {—1, 1} and ô € [0, 1] we write NS3[f] 
for noise sensitivity of f at 5, defined to be the probability that f(x) 4 f(y) 
when x ~ {—1, 1}" is uniformly random and y is formed from x by reversing 
each bit independently with probability ô. In other words, 


1 1 
NS;[f] = a 3 dtabi asl fI.- 


Example 2.44. The constant functions +1 have noise stability 1 for every p. 
The dictator functions x; satisfy Stab,[x;] = for all p (equivalently, 
NSs[x;] = 6 for all 5). More generally, 


Stab,[xs] = E [x°y*] =E [Tes] E | [Eby = I] p= pl, 
Peden ieS ieS ieS 


where we used the fact that the bit pairs (x;, y;) are independent across i to 
convert the expectation of a product to a product of an expectation. 


There is no convenient expression for the noise stability of the majority func- 
tion Stab, [Maj,, ]. However, for a fixed noise rate, the noise stability/sensitivity 
tends to a nice limit as n —> oo: 


Theorem 2.45. For any p € [—1, 1], 
lim Stab, [Maj,,] = 2 arcsin p = 1 — 2 arccos p. 
n odd 
Equivalently, for 5 € [0, 1], 
lim NS;[Maj,,] = 4 arccos(1 — 28). 
n odd 
Using cos(z) = 1 — $z? + O(z*), hence arccos(1 — 25) = 2V5 + O(5*”), we 
deduce 
lim NSs[Maj,] = 2/5 + 08%). 
7 odd 
We prove Theorem 2.45 in Chapter 5.2. A plot of 2 arcsin p appears in 
Figure 2.2. 


There is a simple Fourier formula for the noise stability of a Boolean func- 
tion; it’s one of the most powerful links between the combinatorics of Boolean 
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Figure 2.2. Plot of 2 arcsin p as a function of p 


functions and their Fourier spectra. To determine it, we begin by introducing the 
most important operator in analysis of Boolean functions: the noise operator, 
denoted T, for historical reasons. 


Definition 2.46. For p € [—1, 1], the noise operator with parameter p is the 
linear operator T, on functions f : {—1, 1}” —> R defined by 


Tpf@= E (f(y). 


Y~Np (x) 


Proposition 2.47. For f : {—1,1}" > R, the Fourier expansion of T,f is 
given by 


Tof = Do ol FS) xs = Do ok Ff. 
Scia] k=0 
Proof. Since T, is a linear operator, it suffices to verify that T, xs = p"! xs: 
T = E sy) — E Ta = ol l 
oxs@)= Et l=[] E DAE [px = xs 
ieS ieS 
Here we used the fact that for y ~ N,(x) the bits y; are independent and satisfy 
ELy;] = pxi. 


Exercise 2.25 gives an alternate way of looking at this proof. Yet another proof 
using probability densities and convolution is outlined in Exercise 2.30. 
The connection between T, and noise stability is that 


Stab [f]= E {fen foyl=B| fæ) E (ron 
daa x y~N, (x) 


hence: 


Fact 2.48. Stab [f] = (f, To f). 
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From Plancherel’s Theorem and Proposition 2.47 we deduce the Fourier 
formula for noise stability: 


Theorem 2.49. For f : {-1, 1 —> R, 


Stab, f1= >> pl f(SY = 0 ok WTSI. 


SC[n] k=0 
Hence for f :{—1, 1} —> {-1, 1} we have 
Stab,[f]= E, toll, (2.6) 
NSs[f] = 4X0 — Cl = 28)*) - WAT. (2.7) 
k=0 


Thus the noise stability of f at o is equal to the sum of its Fourier weights, 
attenuated by a factor which decreases exponentially with degree. A simple 
but important corollary is that dictators (and their negations) maximize noise 
stability: 


Proposition 2.50. Let p € (0, 1). If f : {—1, 1}” > {-1, 1} is unbiased, then 
Stab,[f] < p, with equality if and only if f = + x; for some i € [n]. 


Proof. For unbiased f we have W°[f]=0 and hence Stab,[f] = 
boa p*W*[f]. Since pt < p for all k > 1, noise stability is maximized if 
all of f’s Fourier weight is on degree 1. This occurs if and only if f = + ;, 
by Exercise 1.19(a). 


For a fixed function f, it’s often interesting to see how Stab,[f] varies as 
a function of p. From Theorem 2.49 we see that Stab,[f] is a (univariate) 
polynomial with nonnegative coefficients; in particular, it’s an increasing func- 
tion of p on [0, 1]. The derivatives of this polynomial at 0 and 1 have nice 
interpretations, as can be immediately deduced from Theorem 2.49: 


Proposition 2.51. For f : {—1, 1} —> R, 
“ stab, f] | = W'[f] 
dp a P f p=0 — f j 


d 
gS] k =I fl. 


For f : {—1, 1}" > {—1, 1} we have that NS3[f] is an increasing function of 
6 on [0, 1/2], and the second identity is equivalent to 


d 
NSLS |_ =. 
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We conclude this section by introducing a version of influences that also 
incorporates noise. 


Definition 2.52. For f :{—1, 1}" > R, p € [0, 1] andi € [n], the p-stable 
influence of i on f is 


Inf [ f] = Stab, [D; f] = Y o! ASP, 


Si 


with 0° interpreted as 1. We also define I[ f] = $; Inf” [f]. 
Exercise 2.40 asks you to verify the following: 
Fact 2.53. I” [f] = £Stab,[f] = Dye, ko! -WIFI 


The p-stable influence Inf’ ) [f] increases from fü up to Inf;[f] as p 
increases from 0 to |. For 0 < p < 1 there isn’t an especially natural combi- 
natorial interpretation for Inf’ di f beyond Stab, [D; f]; however, we will see 
later that the stable influences are technically very useful. One reason for this 
is that every function f : {—1, 1}” —> {—1, 1} has at most “constantly” many 
“stably-influential” coordinates: 


Proposition 2.54. Suppose f :{—1,1} > R has Var[f]< 1. Given 


0< 8,€ <1, let J = {i € [n] : Inf’ Tf] > e}. Then |J| < 4. 


Proof. Certainly |J| <I"-®[f]/e so it remains to verify IC®[f] < 1/8. 
Comparing Fact 2.53 with Var[ f] = Yeo WÉ*[f] term by term, it suffices to 
show that (1 — 5)*~!k < 1/6 for all k > 1. This is the easy Exercise 2.45. 


It’s good to think of the set J in this proposition as the “notable” coordinates 
for function f. Had we used the usual influences in place of stable influences, 
we would not have been guaranteed a bounded number of “notable” coordinates 
(since, e.g., the parity function xjn has all n of its influences equal to 1). 


2.5. Highlight: Arrow’s Theorem 


When there are just 2 candidates, the majority function possesses all of the 
mathematical properties that seem desirable in a voting rule (e.g., May’s The- 
orem and Theorem 2.33). Unfortunately, as soon as there are 3 (or more) 
candidates the problem of social choice becomes much more difficult. For 
example, suppose we have candidates a, b, and c, and each of n voters has 
a ranking of them. How should we aggregate these preferences to produce a 
winning candidate? 
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In his 1785 Essay on the Application of Analysis to the Probability of 
Majority Decisions (de Condorcet, 1785), Condorcet suggested using the vot- 
ers’ preferences to conduct the three possible pairwise elections, a vs. b, 
b vs. c, and c vs. a. This calls for the use of a 2-candidate voting rule 
f :{-1, 1} — {-1, 1}; Condorcet suggested f = Maj, but we might con- 
sider any such rule. Thus a “3-candidate Condorcet election” using f is con- 
ducted as follows: 


Voters’ Preferences 
#l #2 #B ç -e Societal Aggregation 
aan vs. ben | +1 +1 -1 -e =x fx) 
bay vs. co | +1 -1 +1 © =y fO) 
cay vs. acn | -1 -1 41 -e =z f@ 


In the above example, voter #1 ranked the candidates a > b > c, voter #2 
ranked them a > c > b, voter #3 ranked them b > c > a, etc. Note that the ith 
voter has one of 3! = 6 possible rankings, and these translate into a triple of 
bits (x;, yi, zi) from the following set: 


{(+1,+1,-D, Œ, 1, —1), (—1, +1, —1), 
(1, +1, +1), 41-140, C1 -1,40}. 


These are precisely the triples satisfying the not-all-equal predicate NAE; (see 
Exercise 1.1(i)). 

In the example above, if n = 3 and f = Maj, then the societal outcome 
would be (+1, +1, —1), meaning that society elects a over b, b over c, and 
a over c. In this case it is only natural to declare a the overall winner. 


Definition 2.55. In an election employing Condorcet’s method with f : 
{—1, 1}" —> {-1, 1}, we say that a candidate is a Condorcet winner if it wins 
all of the pairwise elections in which it participates. 


Unfortunately, as Condorcet himself noted, there may not be a Condorcet 
winner. In the example above, if voter #1’s ranking was instead c > a > b 
(corresponding to (+1,—1,+1)), we would obtain the “paradoxical” out- 
come (+1,+1,+1): society prefers a over b, b over c, and c over a! 
This lack of a Condorcet winner is termed Condorcet’s Paradox; it occurs 
when the outcome (f(x), f(y), f(z)) is one of the two “all-equal” triples 
{(-1, -1, -—D, (+1, +1, + D}-. 
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One might wonder if the Condorcet Paradox can be avoided by using a 
voting rule f : {—1, 1}" —> {—1, 1} other than majority. However, in 1950 
Arrow (Arrow, 1950) famously showed that the only means of avoidance is an 
unappealing one: 


Arrow’s Theorem. Suppose f : {—1, 1}” —> {-1, 1} is a unanimous voting 
rule used in a 3-candidate Condorcet election. If there is always a Condorcet 
winner, then f must be a dictatorship. 


(In fact, Arrow’s Theorem is slightly stronger than this; see Exercise 2.51.) 

In 2002 Kalai gave a new proof of Arrow’s Theorem; it takes its cue from the 
title of Condorcet’s work and computes the probability of a Condorcet winner. 
This is done under the “impartial culture assumption” for 3-candidate elections: 
each voter independently chooses one of the 6 possible rankings uniformly at 
random. 


Theorem 2.56. Consider a 3-candidate Condorcet election using f : 
{—1, 1} — {-1, 1}. Under the impartial culture assumption, the probabil- 
ity of a Condorcet winner is precisely ; E 3Stab_1/3[ f]. 


Proof. Let x, y,z € {—1, 1}” be the votes for the elections a vs. b, b vs. c, 
and c vs. a, respectively. Under impartial culture, the bit triples (x;, y;, z;) are 
independent and each is drawn uniformly from the 6 triples satisfying the not- 
all-equal predicate NAE; : {—1, 1} — {0, 1}. There is a Condorcet winner if 
and only if NAE3( f(x), f(y), f(z)) = 1. Hence 


Pr[3 Condorcet winner] = E[NAE3(f(x), f(y), FD]. (2.8) 
The multilinear (Fourier) expansion of NAE; is 
NAE3(w1, w2, w3) = 7 — łW1w2 — Gwiws — 7W2W3; 
thus 
(2.8) = 3 — FELS@ SO) — ELW- ELOS. 


In the joint distribution of x, y the n bit pairs (x;, y;) are independent. Further, 
by inspection we see that E[x;] = E[y;] = 0 and that E[x; y;] = (2/6)(4+1) + 
(4/6)(—1) = —1/3. Hence E[ f(x) f(y)] is precisely Stab_,/3[ f]. Similarly 
EL f(x) f(z)] = EL f(y) f(z)] = Stab_,/3[ f] and the proof is complete. 


Arrow’s Theorem is now an easy corollary: 
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Proof of Arrow’s Theorem. By assumption, the probability of a Condorcet win- 
ner is 1; hence 


3 3 n 
1 = 4 — jStab_iplfl= 7-5) 1/3) WTF. 
k=0 


Since (—1/3)* > —1/3 for all k, the equality above can only occur if all of 
f’s Fourier weight is on degree 1; i.e., W'[f] = 1. By Exercise 1.19(a) this 
implies that f is either a dictator or a negated-dictator. Since f is unanimous, 


it must in fact be a dictator. 


An advantage of Kalai’s analytic proof of Arrow’s Theorem is that we can 
deduce several more interesting results about the probability of a Condorcet 
winner. For example, combining Theorem 2.56 with Theorem 2.45 we get 
Guilbaud’s Formula: 


Guilbaud’s Formula. In a 3-candidate Condorcet election using Maj, the 
probability of a Condorcet winner tends to 


= arccos(—1/3) ~ 91.2%. 


aS n —> ©. 


This is already a fairly high probability. Unfortunately, if we want to improve 
on it while still using a reasonably fair election scheme, we can only set our 
hopes higher by a sliver: 


Theorem 2.57. In a 3-candidate Condorcet election using an f : {—1, 1}" > 
{—1, 1} with all f(i) equal, the probability of a Condorcet winner is at most 
Z + gt +0,(1) ~ 91.9%. 


The condition in Theorem 2.57 seems like it would be satisfied by most 
reasonably fair voting rules f : {—1, 1}” —> {—1, 1} (e-g., it is satisfied if f is 
transitive-symmetric or is monotone with all influences equal). In fact, we will 
show that Theorem 2.57’s hypothesis can be relaxed in Chapter 5.4; we will 
further show in Chapter 11.7 that 5 + = can be improved to the tight value 
= arccos(—1/3) of majority. To return to Theorem 2.57, it is an immediate 
consequence of the following two results, the first being Exercise 2.24 and the 


second being an easy corollary of Theorem 2.56. 


Proposition 2.58. Suppose f : {—1, 1} > {-1, 1} has all fii) equal. Then 
W'Lf] < 2/m + on(1). 
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Corollary 2.59. In a 3-candidate Condorcet election using f : {—1, 1}" > 
{—1, 1}, the probability of a Condorcet winner is at most i + 2w! [f]. 


Proof. From Theorem 2.56, the probability is 
ACW- iW] + iW- AWISI) 
IWI + EW II+ WIH 
IWI + WII HW II+) 

3 IWI] + 40 -W'[f) = $4 Ew]. 


3 — 3Stab_isLf] = 


IA | 
BIW BIW BIW 
+ + 


IA 


Finally, using Corollary 2.59 we can prove a “robust” version of Arrow’s 
Theorem, showing that a Condorcet election is almost paradox-free only if it 
is almost a dictatorship (possibly negated). 


Corollary 2.60. Suppose that in a 3-candidate Condorcet election using f : 
{—1, 1}" — {-1, 1}, the probability of a Condorcet winner is 1 — e. Then f is 
O(€)-close to +x; for some i € |n]. 


Proof. From Corollary 2.59 we obtain that W![f] > 1 — re, The conclusion 
now follows from the FKN Theorem. 


Friedgut-Kalai-Naor (FKN) Theorem. Suppose f : {—1, 1}" —> {—1, 1} 
has W'[f] > 1 — ô. Then f is O(6)-close to +x; for some i € [n]. 


We will see the proof of the FKN Theorem in Chapter 9.1. We’ll also show in 
Chapter 5.4 that the O(5) closeness can be improved to 6/4 + O(6? log(2/8)). 
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2.1 For each function in Exercise 1.1, determine if it is odd, transitive- 
symmetric, and/or symmetric. 

2.2 Show that the n-bit functions majority, AND, OR, +x;, and +1 are all 
linear threshold functions. 


2.3 Prove May’s Theorem: 

(a) Show that f : {—1, 1}” —> {-1, 1} is symmetric and monotone if 
and only if it can be expressed as a weighted majority with aj = az = 
=a, =l. 

(b) Suppose f : {—1, 1}” —> {—1, 1} is symmetric, monotone, and odd. 
Show that n must be odd, and that f = Maj,. 
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2.10 


2.11 


2.12 


2.13 


2.14 


2.15 
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Subset A C {—1, 1}” is called a Hamming ball if A = {x : A(x, z) <r} 

for some z € {—1, 1}” and real r. Show that f :{—1, 1} > {-1, 1} 

is the indicator of a Hamming ball if and only if it’s expressible as 

a linear threshold function f(x) = sgn(ao + axı +--+ + anxn) with 

la1| = la| = --- = lanl. 

Let f : {—1, 1}” > {—1, 1} andi € [n]. We say that f is unate in the ith 

direction if either f(x"?—) < f(x@?") for all x (monotone in the ith 

direction) or f (x“?~)) > f(x?) for all x (antimonotone in the ith di- 

rection). We say that f is unate if it is unate in all n directions. 

(a) Show that fli )| < Inf;[ f] with equality if and only if f is unate in 
the ith direction. 

(b) Show that the second statement of Theorem 2.33 holds even for all 
unate f. 

Show that linear threshold functions are unate. 

For each function f in Exercise 1.1, compute Inf; [f]. 

Let f :{—1, 1}” — {-1, 1}. Show that Inf;[f] < Var[f] for each 

i € [n]. (Hint: Show Inf; [f] < 2 min{Pr[ f = —1], Pri f = 1]}?) 

Let f : {0, 1}© > {—1, 1} be given by the weighted majority f(x) = 

sgn(—58 + 31x, + 31x2 + 28x3 + 21x4 + 2x5 + 2x6). Compute Inf;[f] 

for alli € [6]. 

Say that coordinate i is b-pivotal for f :{—1,1}" > {—1, 1} 

on input x (for be{—1,1}) if f@)=b and f(x) Fb. 

Show that Pr,.[i is b-pivotal on x] = sInf;[ f]. Deduce that I[f] = 

2 E,[# b-pivotal coordinates on x]. 

Let f : {—1, 1}” — {-1, 1} and suppose FS) Æ 0. Show that each coor- 

dinate i € S is relevant for f. 

Let f : {-1, 1}} —> {—1, 1} be a random function (as in Exercise 1.7). 

Compute E[Inf; [f]] and E[I[f]]. 

Let w € N,n = w2”, and write f for Tribes x» : {—1, 1}” > {—1, 1}. 

(a) Compute E[ f ] and Var[ f ], and estimate them asymptotically in terms 
ofn. 

(b) Describe the function D; f. 

(c) Compute Inf; [f] and I[ f] and estimate them asymptotically. 

Let f :{—1, 1}" — R. Show that | D;| f| | < ID; f| pointwise. Deduce 

that Inf;[| f1] < Inf; f] and I[| f1] < If]. 

Prove Proposition 2.24. 


Prove Proposition 2.26. 
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2.17 Prove Proposition 2.37. 
2.18 Let f : {-1, 1}" —> R. Show that 


d 
Lf) = £T FO a aT Oa 
2.19 Suppose f, g : {—1, 1}” —> R have the property that f does not depend 
on the ith coordinate and g does not depend on the jth coordinate (i Æ j). 
Show that E[x;x ; f(x)g(x)] = E[D; f (x)Dig(x)]. 
2.20 For f : {—1, 1}" > {-1, 1} we have that E[sens+(x)] = Es~s,[|S|]. 
Show that also E[sens ¢(x)?] = E||S$|7]. (Hint: Use Proposition 2.37.) 
Is it true that E[sens ¢(x)*] = E|| S71? 
2.21 Let f : {-1, 1} —> Randi € [n]. 
(a) Define Var; f : {—1, 1} —> R by Var; f(x) = Vary,[ f@1,..., Xi-1, 
Xi, Xi41,---,Xn)]. Show that Inf;[f] = E,,[Var; f(x)]. 
(b) Show that 


milf = i E p [fie fie] 


dl pondant 


where fip denotes the function of n — 1 variables gotten by fixing the 
ith input of f to bit b. 
2.22 (a) Show that Inf;[Maj,,] = G as ” for alli € [n]. 

(b) Show that Inf; [Maj,,] is a decreasing function of (odd) n. 

(c) Use Stirling’s Formula m! = (m/e)"(/ 27m + O(m—"/?)) to deduce 
that Inf [Maj,] = LE + 00%). 

(d) Deduce that 2/2 < W'[Maj,,] < 2/m + O(n“!). 

(e) Deduce that ./2/2./n < I[Maj,,] < /2/7./n + O(n-"/?). 

(f) Suppose n is even and f : {—1, 1}” —> {—1, 1}1is a majority function. 
Show that I[ f] = I[Maj, 1] = /2/17./n + O(n). 

2.23 Using only Cauchy—Schwarz and Parseval, give a very simple proof of 
the following weakening of Theorem 2.33: If f : {—1, 1 > {-1, 1} 
is monotone then I[ f] < yn. Extend also to the case of f unate (see 
Exercise 2.5). 

2.24 Prove Proposition 2.58 with O(n~') in place of o0,,(1). (Hint: Show fii) < 


ln + O(n~?/*) using Theorem 2.33.) 


2.25 Deduce T, f(x) = Ys p!’! f(S) x5 using Exercise 1.4. 
2.26 For each function f in Exercise 1.1, compute I[ f]. 
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2.34 


2.35 
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Which functions f : {—1, 1}” > {-1, 1} with#{x : f(x) = 1} = 3 max- 

imize I[ f]? 

Suppose f : {—1, 1}” —> Ris an even function (recall Exercise 1.8). Show 

the improved Poincaré Inequality Var[ f] < HI fil. 

Let f : {—1, 1}" — {-1, 1} be unbiased, E[ f] = 0, and let MaxInf[ f ] 

denote maxj¢{n]{Inf;[ f]}. 

(a) Use the Poincaré Inequality to show MaxInf[ f] > 1/n. 

(b) Prove that I[ f] > 2 — nMaxInf| f1?. (Hint: Prove I[ f] > W'[f]+ 
2(1 — W'[f]) and use Exercise 2.5.) Deduce that MaxInf[ f] > 


Use Exercises 1.1), to deduce the formulas E; f = $- SHi FS) Xs and 
Tof = ds pl FCS) Xs. 

Show that T, is positivity-preserving for p € [—1, 1]; i.e., f > 0 => 
Tof = 0. Show that T, is positivity-improving for p € (—1, 1); i.e., 
f2=0,f40 = T,f>0. 

Show that T, satisfies the semigroup property: To To = Tp, p- 

For p € [—1, 1], show that T, is a contraction on L?({—1, 1}") for all 
p 2 1;i.e., ITpfllp < Il fllp forall f : {-1, 1}" —> R. 

Show that |T, f| < T,| f| pointwise for any f : {—1, 1}" —> R. Further 
show that for —1 < p < 1, equality occurs if and only if f is everywhere 
nonnegative or everywhere nonpositive. 

Fori € [n] and p € R, let T be the operator on functions f : {—1, 1}” > 
R defined by 


Tif =pf +0- p)E:f = Ef + pLi f. 
(a) Show that for p € [—1, 1] we have 


Tİ = E ee, n E js Xi41,+++5Xn)I- 
pf) eget i > Xi—1s Yis Xi+1 Xn)] 


(b) Show that T}, Ti, = Tip (cf. Exercise 2.32) and that any two oper- 


Pıp2 
ators TÍ, and T. commute. 

(c) For (p1, ..., Pn) € R” we define Too,,...0,) = TL TZ, . aa Nee Show 
that T(,,....o) is simply T, and that Ta,...,1,9,1,...,1) (with the p in the ith 
position) is T}. 

(d) For p,..., Pn € [—1, 1], show that Tyo,,...p,) iS a contraction on 
L? ({—1, 1}") for all p > 1 (cf. Exercise 2.33). 

Show that Stab_,[f] = —Stab,[f] if f is odd and Stab_,[f]= 

Stab,[ f] if f is even. 
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2.37 For each function f in Exercise 1.1, compute Stab,[/]. 

2.38 Compute Stab, [Tribes,, s]. 

2.39 Suppose f : {—1, 1}” > {—1, 1} has min(Pr[ f = 1], Pr[ f = —1]) =a. 
Show that NS;[f] < 2a for all 6 € [0, 1]. 

2.40 Verify Fact 2.53. 

2.41 Fix f : {-1, 1}" —> R. Show that Stab,[ f] is a convex function of p on 
[0, 1]. 

2.42 Let f : {—1, 1}” —> {-1, 1}. Show that NS;[f] < ŝI[f] for all ô € [0, 1]. 


2.43 (a) Define the average influence of f : {-1,1}" > R to be [f] = 
IL fl. Now for f : {—1, 1}” — {-1, 1}, show 
6f1= Pr fe) fa 


i~[n] 


and 


Ef] < NSinlfl < ELS]. 
(b) Given f : {—1, 1}” —> {—1, 1} and integer k > 2, define 


1 
Ap OW LW EW 


the “average of the first k tail weights”. Generalizing the second 
statement in part (a), show that Ine Ax < NSi xl f] < Ax. 

2.44 Suppose fi,..., fy: {—1, 1 —> {-1,1} satisfy NS3[fj] < éi. 
Let g :{—1, 1} — {-1,1} and define h :{—1,1}" > {-1,1} by 
h = g(fi,..., fs). Show that NS;[A] < X; €. 

2.45 Complete the proof of Proposition 2.54 by showing that (1 — 6)*"'k < 
1/8 for all 0<6 <1 and k e N+. (Hint: Compare both sides with 
1+0- 8) +- 8) eel 8t.) 

2.46 Fixing f :{—1,1}" —> R, show the following Lipschitz bound for 
Stab, [f] when0 < p—e€<p <1: 


1 
|Stab,[ f] — Stab,_-[f]| < € - P Var[ f]. 


(Hint: Use the Mean Value Theorem and Exercise 2.45.) 

2.47 Let f : {—1, 1}” —> {—1, 1} be a transitive-symmetric function; in the 
notation of Exercise 1.30, this means the group Aut( f) acts transitively 
on [n]. Show that Pra~auwcpla(i) = j] = 1/n for alli, j € [n]. 

2.48 Suppose that F is a functional on functions f : {—1, 1}” —> R expressible 
as F[f] = vs cs fS? where cs > 0 for all $ C [n]. (Examples include 
Var, W*, Inf;, I, Inf, and Stab, for pọ > 0.) Show that F is convex, 
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meaning F[Af + (1 —A)g] < A F[f] + (1 — à) F[g] forall f, g,and à € 

[0, 1]. 

Extend the FKN Theorem as follows: Suppose f :{—1, 1}” —> {—1, 1} 

has WS![f] > 1 — ô. Show that f is O(6)-close to a 1-junta. (Hint: 

Consider g(xo, x) = xo f (xox).) 

Compute the precise probability of a Condorcet winner (under impartial 

culture) in a 3-candidate, 3-voter election using f = Maj;. 

(a) Arrow’s Theorem for 3 candidates is slightly more general than what 
we stated: it allows for three different unanimous functions f, g, h : 
{—1, 1}” —> {—1, 1} to be used in the three pairwise elections. But 
show that if using f, g, h always gives rise to a Condorcet winner 
then f = g = h. (Hint: First show g(x) = — f (—x) for all x by using 
the fact that x, y = —x, and z = (f (x), ..., f(x)) is always a valid 
possibility for the votes.) 

(b) Extend Arrow’s Theorem to the case of Condorcet elections with 
more than 3 candidates. 

The polarizations of f :{—1, 1}” —> R (also known as compressions, 

downshifts, or two-point rearrangements) are defined as follows. For 

i € [n], the i-polarization of f is the function f” : {—1, 1}” —> R defined 

by 


max{ f(x +D), fx@?—YD)} if x; = +1, 
min { f (x2 +D), fad} if x= =]. 


fx) = 


(a) Show that E[ f%] = E[ f] and || f” Ilp = If Ilp for all p. 

(b) Show that Inf; [f°] < Inf;[f] for all j € [n]. 

(c) Show that Stab,[f?'] > Stab [f] for all0 < p < 1. 

(d) Show that f™ is monotone in the ith direction (recall Exercise 2.5). 
Further, show that if f is monotone in the jth direction for some 
j € [n] then f% is still monotone in the jth direction. 

(e) Let f* = f™"%, Show that f* is monotone, E[f*] = EL/f], 
Inf ;[f*] < Inf;[f] for all j € [n], and Stab,[f*] > Stab,[f] for 
alO<p<l. 

The Hamming distance A(x, y) = #{i : x; Æ yi} on the discrete cube 

{—1, 1}” is an example of an £; metric space. For D > 1, we say that 

the discrete cube can be embedded into £ with distortion D if there is a 

mapping F : {—1, 1}” —> R” for some m € N such that: 


|F (x) — FO)|l2 = AC, y) for all x, y; (“no contraction”) 
|F(x) — FO)|l2 < D- AG, y) for all x, y. (“expansion at most D”) 


2.54 
2.55 
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In this exercise you will show that the least distortion possible is 


D=/n. 
(a) Recalling the definition of f°" from Exercise 1.8, show that for any 
f :{—1, 1}" — R we have I < I[f] and hence 


EKS) — fx) < DE (Fœ sae], 
i=1 


(b) Suppose F : {-1, 1}” — R”, and write F(x) = (fi), fo(x),.--, 
fn(x)) for functions f; : {—1,1}" — R. By summing the above 
inequality over i € [m], show that any F with no contraction must 
have expansion at least y/n. 

(c) Show that there is an embedding F achieving distortion y/n. 

Give a Fourier-free proof of the Poincaré Inequality by induction on n. 

Let V be a vector space with norm || - || and fix w,,..., Wn € V. Define 

g:{-1, 1}! > R by g@) = || Li xw; 

(a) Show that Lg < g pointwise. (Hint: Triangle inequality.) 

(b) Deduce 2 Var[g] < E[g?] and thus the following Khintchine—Kahane 


Inequality: 

i 271/2 
E > — -E 5 
x | l T2 x l | 

(Hint: Exercise 2.28.) 


(c) Show that the constant a above is optimal, even if V = R. 


n 
Y xiwi 
i=l 


n 
Yo xiwi 
i=l 


In the correlation distillation problem, a source chooses x ~ {—1, 1}” 
uniformly at random and broadcasts it to g parties. We assume that the 
transmissions suffer from some kind of noise, and therefore the players 
receive imperfect copies y?, ..., y of x. The parties are not allowed to 
communicate, and despite having imperfectly correlated information they 
wish to agree on a single random bit. In other words, the ith party will 
output a bit f(y) € {—1, 1}, and the goal is to find functions fi, ..., fg 
that maximize the probability that fi(y) = ROP) = --- = HOD). 

To avoid trivial deterministic solutions, we insist that E[ f;(y“)] be 0 for 

all j € [q]. 

(a) Suppose q = 2, p € (0, 1), and y? ~ N,(x) independently for each 
j. Show that the optimal solution is fı = f2 = +x; for some i € [n]. 
(Hint: Yovu’ll need Cauchy—Schwarz.) 

(b) Show the same result for g = 3. 

(c) Let q = 2 and p € (5, 1). Suppose that y® = x exactly, but y® € 


{—1, 0, 1}” has erasures: it’s formed from x by setting y? = x; with 
probability o and y? = 0 with probability 1 — p, independently for 


i 
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all i € [n]. Show that the optimal success probability is 5 + sp and 
there is an optimal solution in which fı = + x; for any i € [n]. (Hint: 
Eliminate the source, and introduce a fictitious party 1’. . .) 

(d) Consider the previous scenario but with p € (0, 5). Show that if n is 
sufficiently large, then the optimal solution does not have fı = + x;. 


(a) Letg: {-1, 1 > R=° have E[g] = ô. Show that for any p € [0, 1], 


n n 
PY BUS 8+ >> pg leo. 
j=l k=2 


(Hint: Exercise 2.31.) 
(b) Assume further that g : {—1, 1}” > {0,1}. Show that ||g*lloo < 


V6,/ (7). (Hint: First bound ||g=*||3.) Deduce p Via (BWI < 8 + 


2p?/6n, assuming p < A 

(c) Show that X5 IROI < 2,/257/4./n (assuming ô < 1/4). Deduce 
W![g] < 2V2-67/4./n. (Hint: show [2(/)| < 6 for all j.) 

(d) Suppose f : {—1, 1}” —> {—1, 1} is monotone and MaxInf[ f] < ô. 
Show W2[f] < V2- 89/4 -ILf]- Ji. 

(e) Suppose further that f is unbiased. Show that MaxInf[ f] < o(n~2/3) 
implies I[ f] > 3 — o(1); conclude MaxInf[ f] > 3 — o(1/n). (Hint: 
Extend Exercise 2.29.) Use Exercise 2.52 to remove the assumption 
that f is monotone for these statements. 


Let V be a vector space (over R) with norm || - Iv. If f : {-1, 1P ~ V 
we can define its Fourier coefficients FS) Ee V by the usual 
formula FS) = Exe1,1p [f @)x5]. We may also define ||fllp = 
Exe ia lll f(x) ||P 1!/”. Finally, if the norm ||- ||y arises from an inner 
product (-,-)y on V we can define an inner product on functions 
fig: {-L 1}! > V by (f, g) = Esei 1p (fŒ), g@))v]. The mate- 
rial developed so far in this book has used V = R with (-,-)y being 
multiplication. Explore the extent to which this material extends to the 
more general setting. 


Notes 


The mathematical study of social choice began in earnest in the late 1940s; see 
Riker (Riker, 1961) for an early survey or the compilation (Brams et al., 2009) for 
some modern results. Arrow’s Theorem was the field’s first major result; Arrow proved 
it in 1950 (Arrow, 1950) under the extra assumption of monotonicity (and with a minor 
error (Blau, 1957)), with the refined version appearing in 1963 (Arrow, 1963). He 
was awarded the Nobel Prize for this work in 1972. May’s Theorem is from 1952 
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(May, 1952). Guilbaud’s Formula is also from 1952 (Guilbaud, 1952), though Guil- 
baud only stated it in a footnote and wrote that it is computed “by the usual means 
in combinatorial analysis”. The first published proof appears to be due to Garman 
and Kamien (Garman and Kamien, 1968); they also introduced the impartial culture 
assumption. The term “junta” appears to have been introduced by Parnas, Ron, and 
Samorodnitsky (Parnas et al., 2001). 

The notion of influence Inf;[f] was originally introduced by the geneticist Pen- 
rose (Penrose, 1946), who observed that Inf;[Maj, ] ~ Sa It was rediscovered by 
the lawyer Banzhaf in 1965 (Banzhaf, 1965); he sued the Nassau County (NY) Board 
after proving that the voting system it used (the one in Exercise 2.9) gave some towns 
zero influence. Influence is sometimes referred to as the Banzhaf, Penrose—Banzhaf, or 
Banzhaf—Coleman index (Coleman being another rediscoverer (Coleman, 1971)). Influ- 
ences were first studied in the computer science literature by Ben-Or and Linial (Ben-Or 
and Linial, 1985); they introduced also introduced “tribes” as an example of a function 
with constant variance yet small influences. The Fourier formulas for influence may have 
first appeared in the work of Chor and Geréb-Graus (Chor and Geréb-Graus, 1987). 

Total influence of Boolean functions has long been studied in combinatorics, since it 
is equivalent to edge-boundary size for subsets of the Hamming cube. For example, the 
edge-isoperimetric inequality was first proved by Harper in 1964 (Harper, 1964). In the 
context of Boolean functions, Karpovsky (Karpovsky, 1976) proposed I[ f] as a measure 
of the computational complexity of f, and Hurst, Miller, and Muzio (Hurst et al., 1982) 
gave the Fourier formula )>, |S |f(sy. The terminology “Poincaré Inequality” comes 
from the theory of functional inequalities and Markov chains; the inequality is equivalent 
to the spectral gap for the discrete cube graph. 

The noise stability of Boolean functions was first studied explicitly by Benjamini, 
Kalai, and Schramm in 1999 (Benjamini et al., 1999), though it plays an important role 
in the earlier work of Hastad (Hastad, 1997). See O’Donnell (O’ Donnell, 2003) for a 
survey. The noise operator was introduced by Bonami (Bonami, 1970) and independently 
by Beckner (Beckner, 1975), who used the notation T, which was standardized by Kahn, 
Kalai, and Linial (Kahn et al., 1988). For nonnegative noise rates it’s often natural to 
use the alternate parameterization T,-: for t € [0, oo]. 

The Fourier approach to Arrow’s Theorem is due to Kalai (Kalai, 2002); he also 
proved Theorem 2.57 and Corollary 2.60. The FKN Theorem is due to Friedgut, Kalai, 
and Naor (Friedgut et al., 2002); the observation from Exercise 2.49 is due to Kindler. 

The polarizations from Exercise 2.52 originate in Kleitman (Kleitman, 1966). Exer- 
cise 2.53 is a theorem of Enflo from 1970 (Enflo, 1970). Exercise 2.55 is a theorem of 
Latata and Oleszkiewicz (Latata and Oleszkiewicz, 1994). In Exercise 2.56, part (b) is 
due to Mossel and O’ Donnell (Mossel and O’ Donnell, 2005); part (c) was conjectured 
by Yang (Yang, 2004) and proved by O’Donnell and Wright (O’Donnell and Wright, 
2012). Exercise 2.57 is a polishing of the 1987 work by Chor and Geréb-Graus (Chor and 
Geréb-Graus, 1987, 1988), a precursor of the KKL Theorem. The weaker Exercise 2.29 
is also due to them and Noga Alon independently. 
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Spectral Structure and Learning 


One reasonable way to assess the “complexity” of a Boolean function is in terms 
how complex its Fourier spectrum is. For example, functions with sufficiently 
simple Fourier spectra can be efficiently learned from examples. This chapter 
will be concerned with understanding the location, magnitude, and structure of 
a Boolean function’s Fourier spectrum. 


3.1. Low-Degree Spectral Concentration 


One way a Boolean function’s Fourier spectrum can be “simple” is for it to be 
mostly concentrated at small degree. 


Definition 3.1. We say that the Fourier spectrum of f : {—1, 1}" > R is 
€-concentrated on degree up to k if 


w"'[fl= } fy se. 


SC[n] 
|Sl>k 


For f : {—1, 1}” — {—1, 1} we can express this condition using the spectral 


sample: Prs~s,[|S| > k] < €. 


It’s possible to show such a concentration result combinatorially by showing 
that a function has small total influence: 


Proposition 3.2. For any f : {—1, 1}" —> Rand e > 0, the Fourier spectrum 
of f is €-concentrated on degree up to I f ]/e. 


Proof. This follows immediately from Theorem 2.38, I[ f] = $ zok- WEES]. 
For f : {—1, 1} —> {—1, 1}, this is Markov’s inequality applied to the cardi- 
nality of the spectral sample. 
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For example, in Exercise 2.13 you showed that I[Tribes,, .»] < O(log n), 
where n = w2”; thus this function’s spectrum is .01-concentrated on degree 
up to O(log 7), a rather low level. Proving this by explicitly calculating Fourier 
coefficients would be quite painful. 

Another means of showing low-degree spectral concentration is through 
noise stability/sensitivity: 

Proposition 3.3. For any f : {—1, 1} > {—1, l}andé € (0, 1/2], the Fourier 
spectrum of f is €-concentrated on degree up to 1/6 for 
€ = NSLS] < 3NSsL/1- 


Proof. Using the Fourier formula from Theorem 2.49, 


2NS;[f] = E [1—(1—26)'*!] 
S~S; 
> (1 — (1 — 26)')- Pr [IS] > 1/6] 
S~8 > 
>(1—e7’)- Pr [|S] > 1/8], 
S~Sy 


where the first inequality used that 1 — (1 — 26)* is a nonnegative nondecreas- 
ing function of k. The claim follows. 


As an example, Theorem 2.45 tells us that for ô > O sufficiently small and n 
sufficiently large (as a function of ô), NSs[Maj,,] < vô. Hence the Fourier 
spectrum of Maj,, is 3,/5-concentrated on degree up to 1/5; equivalently, it 
is €-concentrated on degree up to 9/e*. (We will give sharp constants for 
majority’s spectral concentration in Chapter 5.3.) This example also shows 
there is no simple converse to Proposition 3.2; although Maj,, has its spectrum 
.01-concentrated on degree up to O(1), its total influence is O(,/7). 

Finally, suppose a function f : {—1, 1}” —> {—1, 1} has its Fourier spectrum 
0-concentrated up to degree k; in other words, f has real degree deg( f) < k. 
In this case f must be somewhat simple; indeed, if k is a constant, then f is a 
junta: 


Theorem 3.4. Suppose f : {—1, 1}" —> {-1, 1} has deg( f) < k. Then f is a 
k2'-! junta. 


The bound k2‘~! cannot be significantly improved; see Exercise 3.24. The 
key to proving Theorem 3.4 is the following lemma, the proof of which is 
outlined in Exercise 3.4: 


Lemma 3.5. Suppose deg(f) < k, where f :{-1,1}" — R is not identi- 
cally 0. Then Pr[ f(x) 4 0] > 27*. 


56 3 Spectral Structure and Learning 


Since deg(D; f) < k — 1 when deg(f) < k (by the “differentiation” for- 
mula) and since Inf;[f] = Pr[D; f(x) 4 0] for Boolean-valued f, we imme- 
diately infer: 


Proposition 3.6. If f : {—1, 1}” —> {-1, 1} has deg(f) < k then Inf;[ f] is 
either 0 or at least 2'—* for alli € [n]. 


We can now give the proof of Theorem 3.4. From Proposition 3.6 the number 
of coordinates which have nonzero influence on f is at most I[ f]/2!~*, and 
this in turn is at most k2‘~! by the following fact: 


Fact 3.7. For f : {—1, 1}" > {-1, 1}, ILf] < deg( f). 


Fact 3.7 is immediate from the Fourier formula for total influence. 

We remark that the FKN Theorem (stated in Chapter 2.5) is a “robust” 
version of Theorem 3.4 for k = 1. In Chapter 9.6 we will see Friedgut’s Junta 
Theorem, a related robust result showing that if I[ f] < k then f is €-close to 
a 29/9) junta. 


3.2. Subspaces and Decision Trees 


In this section we treat the domain of a Boolean function as F5}, an n-dimensional 
vector space over the field F}. As mentioned in Chapter 1.2, it can be natural 
to index the Fourier characters xs : FS — {—1, 1} not by subsets S$ C [n] but 
by their 0-1 indicator vectors y € F}; thus 


x)= CD, 


with the dot product y - x being carried out in F}. For example, in this notation 
we’d write xo for the constantly 1 function and xe, for the ith dictator. Fact 1.6 
now becomes 


XpXy = Xpry VB y.- (3.1) 


Thus the characters form a group under multiplication, which is isomorphic to 
the group F; under addition. To distinguish this group from the input domain 
we write it as F; we also tend to identify the character with its index. Thus the 
Fourier expansion of f : F} —> R can be written as 


fe) = $ foxa). 
r 


The Fourier transform of f can be thought of as a function f: F5 > R. We 
can measure its complexity with various norms. 
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Definition 3.8. The Fourier (or spectral) p-norm of f : {—1, 1}" > Ris 
1/p 
iri =| do Foo 
vef 


Note that we use the “counting measure” on F, and hence we have a 
nice rephrasing of Parseval’s Theorem: || f||2 = i f fl. We make two more 
definitions relating to the simplicity of f: 


Definition 3.9. The Fourier (or spectral) sparsity of f : {—1, 1}" > Ris 
sparsity(f) = |supp(f)| = #[y E€ Fi : fy) Æ o}. 

Definition 3.10. We say that fis e-granular if Fiy) is an integer multiple of € 

forall y € F3. 


To gain some practice with this notation, let’s look at the Fourier transforms 
of some indicator functions 14 : F} — {0, 1} and probability density functions 
ga, where A C F}. First, suppose A < F} is a subspace. Then one way to 
characterize A is by its perpendicular subspace At: 

At ={y € Fi: y- x= 0 forall x € A}. 
It holds that dim A+ = n — dim A (this is called the codimension of A) and 
that A = (A+)+. 
Proposition 3.11. Jf A < F} has codim A = dim At = k, then 
l4 = os Saas oe pa = > Xy. 
yeAt yeAt 


Proof. Let y1, ..., yg form a basis of A+. Since A = (A+)+ it follows that 
x € A if and only if x,,(x) = 1 for all i € [k]. We therefore have 


k 
u= ($+ 3%.) =2%* YO xe) 


i=1 yespan{y1,.... Yk} 


as claimed, where the last equality used (3.1). The Fourier expansion of p4 
follows because E[1,4] = 2~*. 


More generally, suppose A is affine subspace (or coset) of F5; i.e., A= 
H +a for some H < F; anda € F%, or equivalently 


A={xeFS:y-x=y-aforall y € H}. 


Then it is easy (Exercise 3.11) to extend Proposition 3.11 to: 
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Proposition 3.12. If A = H + a is an affine subspace of codimension k, then 


—k , A i 
hop- [X02 tren 


else; 


hence a4 = 2 yen Xy(a)Xy- We have sparsity(I,) = 2 í is 2-*-granular, 
allo = 2%, and Î14f = 1. 


In computer science terminology, any f : F} — {0, 1} that is a conjunction 
of parity conditions is the indicator of an affine subspace (or the zero function). 
In the simple case that the parity conditions are all of the form “x; = a;”, the 
function is a logical AND of literals, and we call the affine subspace a subcube. 

Another class of Boolean functions with simple Fourier spectra are the ones 
computable by simple decision trees: 


Definition 3.13. A decision tree T is a representation of a Boolean function 
f : F5 — R. It consists of a rooted binary tree in which the internal nodes are 
labeled by coordinates i € [n], the outgoing edges of each internal node are 
labeled 0 and 1, and the leaves are labeled by real numbers. We insist that no 
coordinate i € [n] appears more than once on any root-to-leaf path. 

On input x € F}, the tree T constructs a computation path from the root 
node to a leaf. Specifically, when the computation path reaches an internal node 
labeled by coordinate i € [n] we say that T queries x;; the computation path 
then follows the outgoing edge labeled by x;. The output of T (and hence f) 
on input x is the label of the leaf reached by the computation path. We often 
identify a tree with the function it computes. 


For decision trees, a picture is worth a thousand words; see Figure 3.1. 

(It’s traditional to write x; rather than i for the internal node labels.) For 
example, the computation path of the above tree on input x = (0, 1, 0) € F3 
starts at the root, queries xı, proceeds left, queries x3, proceeds left, queries 
x2, proceeds right, and reaches a leaf labeled 0. In fact, this tree computes 


Figure 3.1. Decision tree computing Sorts 
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the function Sort; defined by Sort3(x) = 1 if and only if xı < x2 < x3 or 
Xi = X2 = X3. 


Definition 3.14. The size s of a decision tree T is the total number of leaves. 
The depth k of T is the maximum length of any root-to-leaf path. For decision 
trees over F} we have k < n and s < 2*. Given f : F} > R we write DT( f) 
(respectively, DT,;,e(f)) for the least depth (respectively, size) of a decision 
tree computing f. 


The example decision tree in Figure 3.1 has size 6 and depth 3. 

Let T be a decision tree computing f : F} — R and let P be one of its 
root-to-leaf paths. The set of inputs x that follow computation path P in T is 
precisely a subcube of F}, call it Cp. The function f is constant on Cp; we 
will call its value there f(P). Further, since every input x follows a unique 
path in T, the subcubes {Cp : P a path in T} form a partition of F}. These 
observations yield the following “spectral simplicity” results for decision trees: 


Fact 3.15. Let f : F} —> R be computed by a decision tree T. Then 


f= yo O 


paths P of T 


Proposition 3.16. Let f : F} —> R be computed by a decision tree T of size s 
and depth k. Then: 


° deg(f) < k; 

. sparsity( f) < 9D" < 44; 
elaire ssi 

e f is 2-*-granular assuming f : F} > Z. 


Proposition 3.17. Let f : F} — {—1, 1} be computable by a decision tree of 
size s and let € € (0, 1]. Then the spectrum of f is €-concentrated on degree 
up to log(s/€). 


You are asked to prove these propositions in Exercises 3.21 and 3.22. Similar 
spectral simplicity results hold for some generalizations of the decision tree 
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representation (“subcube partitions”, “parity decision trees”); see Exercise 3.26. 


3.3. Restrictions 


A common operation on Boolean functions f : {—1, 1}" —> R is restriction to 
subcubes. Suppose [n] is partitioned into two sets, J and J = [n] \ J. If the 
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inputs bits in J are fixed to constants, the result is a function {—1, 1}’ > R. 

For example, if we take the function Maj, : {—1, 1}> — {—1, 1} and restrict 

the 4th and 5th coordinates to be 1 and —1 respectively, we obtain the function 

Maj; : {—1, 1} — {—1, 1}. If we further restrict the 3rd coordinate to be —1, 

we obtain the two-bit function which is | if and only if both input bits are 1. 
We introduce following notation: 


Definition 3.18. Let f : {—1, 1}" > R and let (J, J) be a partition of [n]. Let 
z€{-l, ly. Then we write fj; : {—1, 1}/ — R (pronounced “the restriction 
of f to J using z”) for the subfunction of f given by fixing the coordinates 
in J to the bit values z. When the partition (J, J) is understood we may write 
simply fiz. If y € {—1, 1}! and z € {—1, 1} we will sometimes write (y, z) 
for the composite string in {—1, 1}”, even though y and z are not literally 
concatenated; with this notation, fj) = f, 2). 


Let’s examine how restrictions affect the Fourier transform by considering 
an example. 


Example 3.19. Let f : {—1, 1}* —> {—1, 1} be the function defined by 


f@m=1 S> x3 = x4 = -—l or xı > X2 > x3 > x4 Or 
X1 Sx. < X3 < X4. (3.2) 


You can check that f has the Fourier expansion 


14 1 1 1 
F(X) = + 5 — 341 + 3x2 — 3x3 — 344 
3 1 3 3 1 5 
+ 3X1X2 + gX1X3 — §X1X4 + §X2X3 — gX2X4 + ŞX3X4 (3.3) 
1 1 1 1 1 
+ gX1X2X3 + gX1X2X4 — g X1X3X4 + gX2X3X4 — g X1X2X3X4. 
Consider the restriction x3 = 1, x4 = —1, and let f’ = fu,2a,-1) be the 
restricted function of xı and x2. From the original definition (3.2) of f we 
see that f’(x1, x2) is 1 if and only if xı = x2 = 1. This is the min, function 
of xı and x2, which we know has Fourier expansion 


f' (1, x2) = ming(x1, x2) = —4 + ixi + ix + bx Hr. (3.4) 


We can of course obtain this expansion simply by plugging x3 = 1, x4 = —1 
into (3.3). Now suppose we only wanted to know the coefficient on x, in the 
Fourier expansion of f’. We can find it as follows: Consider all monomials 
in (3.3) that contain x; and possibly also x3, x4; substitute x3 = 1, x4 = — 1 
into the associated terms; and sum the results. The relevant terms in (3.3) are 
— 3x1, + 34x23, — xix, — $ X1X3X4, and substituting in x3 = 1, x4 = —1 gives 
us -i + i + 3 + t = 5, as expected from (3.4). 
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input: l x z | 
coordinate partition: J J 
UI UI 
S T 


Figure 3.2. Notation for a typical restriction scenario. Note that J and J need 
not be literally contiguous. 


Now we work out these ideas more generally. In the setting of Definition 3.18 
the restricted function f7\, has {—1, 1}/ as its domain. Thus its Fourier coef- 
ficients are indexed by subsets of J. Let’s introduce notation for the Fourier 
coefficients of a restricted function: 


Definition 3.20. Let f : {—1, 1}” > Rand let (J, J) be a partition of [n]. Let 
S C J. Then we write Foo f : {—1, 1}’ = R for the function Fre(S); i.e., 


Fag fE) = Fi(9). 
When the partition (J, J) is understood we may write simply Fs) f . 


In Example 3.19 we considered J = {3,4}, S = {1}, and z = (1, —1). See 
Figure 3.2 for an illustration of a typical restriction scenario. 

In general, for a fixed partition (J, J) of [n] and a fixed S C J, we may 
wish to know what Fil) is as a function of z € {—1, ly’. This is precisely 
asking for the Fourier transform of Fs,7 f. Since the function F.,7 f has domain 
{-1, 17, its Fourier transform has coefficients indexed by subsets of J. The 
formula for this Fourier transform generalizes the computation we used at the 
end of Example 3.19: 


Proposition 3.21. In the setting of Definition 3.20 we have the Fourier expan- 
sion 


Faf = X ASUT)"; 


TEJ 


Fog f(T) = SUT). 


Proof. (The S = Ø case here is Exercise 1.15.) Every U C [n] indexing f’s 
Fourier coefficients can be written as a disjoint union U = S U T, where S C J 
and T C J. We can also decompose any x € {—1, 1}" into two substrings 
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y € {—1, 1}! andz € {—1, 1}7. We have x¥ = y5z? and so 


roa > (OK => Forney l= yy nA e 
Ucin] ee ScJ TOT 


Thus when z is fixed, the resulting function of y indeed has Vera fis UT)z? 
as its Fourier coefficient on the monomial y5. 


Corollary 3.22. Let f : {—1, 1}" > R, let (J, J) be a partition of [n], and fix 
S C J. Suppose z ~ {—1, 1}/ is chosen uniformly at random. Then 


EL fre] = f(S), 
Efes = D>) fSUTY. 
TEJ 


Proof. The first statement is immediate from Proposition 3.21, taking T = Ø 
and unraveling the definition. As for the second statement, 


EL frje(S)"] = ElF s7 f(@)"] (by definition) 
= 5 Fy f(TY (Parseval) 
ToT 
= ASUT? (Proposition 3.21) 
TEJ 


We move on to discussing a more general kind of restriction; namely, restrict- 
ingafunction f : F} —> R toan affine subspace H + z. This generalizes restric- 
tion to subcubes as we’ve seen so far, by considering H = span{e; : i € J} for 
a given subset J C [n]. For restrictions to a subspace H < F} we have a natural 
definition: 


Definition 3.23. If f : F} —> Rand H < F3 is a subspace, we write fy : H > 
R for the restriction of f to H. 


For restrictions to affine subspaces, we run into difficulties if we try to 
extend our notation for restrictions to subcubes. Unlike in the subcube case 
of H = span{e; : i € J}, we don’t in general have a canonical isomorphism 
between H and a coset H + z. Thus it’s not natural to introduce notation such 
as fri, : H — R for the function h > f(h + z), because such a definition 
depends on the choice of representative for H + z. As an example consider H = 
{(0, 0), (1, 1)} < F2, a 1-dimensional subspace (which satisfies H+ = H). Here 
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the nontrivial coset is H + (1,0) = H + (0, 1) = {(1, 0), (0, 1)}, which has no 
canonical representative. 

To get around this difficulty we can view restriction to a coset H + z as con- 
sisting of two steps: first, translation of the domain by a fixed representative z, 
and then restriction to the subspace H. Let’s introduce some notation for the 
first operation: 


Definition 3.24. Let f : F} — R and let z € F5. We define the function 
ft? PS > Rby f(x) = f(a +2). 


By substituting x = x + z into the Fourier expansion of f, we deduce: 


Fact 3.25. The Fourier coefficients of f ™ are given by fry) =(- Lb’ f(y); 


i.e., 


FP@)= YS ZOIIE TE] 


yeF; 


(This fact also follows by noting that f** = ptz} * f; see Exercise 3.31.) 
We can now give notation for the restriction of a function to an affine 
subspace: 


Definition 3.26. Let f : F} > R, z € F}, H < F}. We write f°: H —> R 
for the function (f**)y; namely, the restriction of f to coset H + z with the 
representative z made explicit. 


Finally, we would like to consider Fourier coefficients of restricted functions 
f es 2. These can be indexed by the cosets of H+ in F. However, we again have 
a notational difficulty since the only coset with a canonical representative is 
Ht itself, with representative 0. There is no need to introduce extra notation 
for fy (0), the average value of f on coset H + z, since it is just 


Elf +21 (pu, f*). 


Applying Plancherel on the right-hand side, as well as Proposition 3.11 and 
Fact 3.25, we deduce the following classical fact: 


Poisson Summation Formula. Let f : F} > R, H < F}, z € F}. Then 


FE LSht l= > uw @Fly). 


yeHt 
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3.4. Learning Theory 


Computational learning theory is an area of algorithms research devoted to 
the following task: Given a source of “examples” (x, f(x)) from an unknown 
function f, compute a “hypothesis” function A that is good at predicting f(y) 
on future inputs y. In this book we will focus on just one possible formulation 
of the task: 


Definition 3.27. In the model of PAC (“Probably Approximately Correct”) 
learning under the uniform distribution on {—1, 1}", a learning problem is 
identified with a concept class €, which is just a collection of functions f : 
{—1, 1}" > {-1, 1}. A learning algorithm A for @ is a randomized algorithm 
which has limited access to an unknown target function f € €. The two access 
models, in increasing order of strength, are: 


e random examples, meaning A can draw pairs (x, f(x)) where x € {—1, 1}” 
is uniformly random; 

e queries, meaning A can request the value f(x) for any x € {—1, 1}” of its 
choice. 


In addition, A is given as input an accuracy parameter € € [0, 1/2]. The output 
of A is required to be (the circuit representation of) a hypothesis function 
h : {—1, 1}” > {-1, 1}. We say that A learns 6 with error € if for any f € ©, 
with high probability A outputs an h which is €-close to f: i.e., satisfies 
dist( f, h) < e. 


In the above definition, the phrase “with high probability” can be fixed 
to mean, say, “except with probability at most 1/10”. (As is common with 
randomized algorithms, the choice of constant 1/10 is unimportant; see Exer- 
cise 3.40.) 

For us, the main desideratum of a learning algorithm is efficient running time. 
One can easily learn any function f to error 0 in time O2") (see Exercise 3.33); 
however, this is not very efficient. If the concept class @ contains very complex 
functions, then such exponential running time is necessary; however, if @ 
contains only relatively “simple” functions, then more efficient learning may 
be possible. For example, the results of Section 3.5 show that the concept class 


€={f : Fi > {-1, 1} | DTsize(f) < 5} 


can be learned with queries to error € by an algorithm running in time 
poly(s, n, 1/e). 
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A common way of trying to learn an unknown target f : {—1, 1}” —> {-1, 1} 
is by discovering “most of” its Fourier spectrum. To formalize this, let’s gen- 
eralize Definition 3.1: 


Definition 3.28. Let F be a collection of subsets $ C [n]. We say that the 
Fourier spectrum of f : {—1, 1}" — R is €-concentrated on F if 


DEOR 

Sc[n] 

SEF 
For f : {—1, 1}” — {—1, 1} we can express this condition using the spectral 
sample: Prs~s [S € F] < €. 


Most functions don’t have their Fourier spectrum concentrated on a small 
collection (see Exercise 3.35). But for those that do, we may hope to discover 
“most of” their Fourier coefficients. The main result of this section is a kind 
of “meta-algorithm” for learning an unknown target f. It reduces the problem 
of learning f to the problem of identifying a collection of characters on which 
f’s Fourier spectrum is concentrated. 


Theorem 3.29. Assume learning algorithm A has (at least) random example 
access to target f :{—1, 1}” — {—1, 1}. Suppose that A can — somehow — 
identify a collection F of subsets on which f’s Fourier spectrum is 
e/2-concentrated. Then using poly(| F|, n, 1/€) additional time, A can with 
high probability output a hypothesis h that is €-close to f. 


The idea of the theorem is that A will estimate all of f’s Fourier coefficients 
in F, obtaining a good approximation to f’s Fourier expansion. Then A’s 
hypothesis will be the sign of this approximate Fourier expansion. 

The first tool we need to prove Theorem 3.29 is the ability to accurately 
estimate any fixed Fourier coefficient: 


Proposition 3.30. Given access to random examples from f : {—1, 1}" > 
{—1, 1}, there is a randomized algorithm which takes as input S C [n], 0 < 
ô, € < 1/2, and outputs an estimate f(S) for f(S) that satisfies 


(8) — f(S) se 
except with probability at most ô. The running time is poly(n, 1/€) - log(1/6). 


Proof. We have FS) = E, [f (x)xs(x)]. Given random examples (x, f(x)), 
the algorithm can compute f(x)xs(x) € {—1, 1} and therefore empirically 
estimate E,[f(x)x5(x)]. A standard application of the Chernoff bound implies 
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that O(log(1/8)/e?) examples are sufficient to obtain an estimate within +e 
with probability at least 1 — ô. 


The second observation we need to prove Theorem 3.29 is the following: 


Proposition 3.31. Suppose that f : {—1, 1} > {—1, 1} and g : {-1, 1" > 
R satisfy || f — gl <e. Let h: {-1,1}" —> {-1, 1} be defined by h(x) = 
sgn(g(x)), with sgn(0) chosen arbitrarily from {—1, 1}. Then dist( f, h) < €. 


Proof. Since | f(x) — g(x)? > 1 whenever f(x) 4 sgn(g(x)), we conclude 
dist(f, h) = Pr f(x) # h) = Ell foseman] < ELF) — g@)?] 
= If — sll. 


(See Exercise 3.34 for an improvement to this argument.) 
We can now prove Theorem 3.29: 


Proof of Theorem 3.29. For each S € F the algorithm uses Proposition 3.30 
to produce an estimate FS) for FS) which satisfies IFCS) — fiS)| < 
J€/(2/|F]) except with probability at most 1/(10| F|). Overall this requires 
poly(|F|, n, 1/e) time, and by the union bound, except with probability at 
most 1/10 all |F| estimates have the desired accuracy. Finally, A forms the 
real-valued function g = } ge F Fi (S)xs and outputs hypothesis h = sgn(g). 
By Proposition 3.31, it suffices to show that || f — gll < €. And indeed, 


If — gil = > f — a(S) (Parseval) 
SC{n] 
= FO- fio” + > ASY 
SEF SEF 


2 
= E w/e ) +e€/2 (estimates, concentration assumption) 
som SNIF 


= €/4+ €/2 < €, 


as desired. 


As we described, Theorem 3.29 reduces the algorithmic task of learning f 
to the algorithmic task of identifying a collection F on which f’s Fourier 
spectrum is concentrated. In Section 3.5 we will describe the Goldreich—Levin 
algorithm, a sophisticated way to find such an F assuming query access to f. 
For now, though, we observe that for several interesting concept classes we 
don’t need to do any algorithmic searching for F, we can just take F to be 
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all sets of small cardinality. This works whenever all functions in @ have 
low-degree spectral concentration. 


The “Low-Degree Algorithm”. Let k > 1 and let 6 be a concept class for 
which every function f : {—1, 1}" > {—1, 1} in € is €/2-concentrated up to 
degree k. Then € can be learned from random examples only with error € in 
time poly(n*, 1/e). 


Proof. Apply Theorem 3.29 with F = {S C [n]: |S| < k}. We have |F| = 
Lizo (5) < On"). 


The Low-Degree Algorithm reduces the algorithmic problem of learning € 
from random examples to the analytic task of showing low-degree spectral 
concentration for the functions in @. Using the results of Section 3.1 we can 
quickly obtain some learning-theoretic results. For example: 


Corollary 3.32. Fort > 1, let@={f : {-1, 1}” > {-1, 1} | ILf] < t}. Then 
€ is learnable from random examples with error € in time n? “9, 


Proof. Use the Low-Degree Algorithm with k = 2t/e; the result follows from 
Proposition 3.2. 


Corollary 3.33. Let € = {f : {-1, 1}" — {-1, 1} | f is monotone}. Then € 
is learnable from random examples with error € in time n°"!®), 


Proof. Follows from the previous corollary and Theorem 2.33. 


You might be concerned that a running time such as n°‘V” does not seem 
very efficient. Still, it’s much better than the trivial running time of o (2”). Fur- 
ther, as we will see in the next section, learning algorithms are sometimes used 
in attacks on cryptographic schemes, and in this context even subexponential- 
time algorithms are considered dangerous. 

Continuing with applications of the Low-Degree Algorithm: 


Corollary 3.34. For 5€(0,1/2], let @={f: {-1,1}" > {-1, 1} | 
NSs[f] < €/6}. Then © is learnable from random examples with error € in 
time poly(n'/*, 1/e). 


Proof. Follows from Proposition 3.3. 


Corollary 3.35. Let € = {f : {-1, 1}” > {-1, 1} | DTsize(f) < s}. Then € 


is learnable from random examples with error € in time nO°8s/), 


Proof. Follows from Proposition 3.17. 
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With a slight extra twist one can also exactly learn the class of degree-k 
functions in time poly(n*); see Exercise 3.36: 


Theorem 3.36. Letk > 1 andlet 6 = {f : {—1, 1}” > {-1, 1} | deg(f) < k} 
(e.g., 6 contains all depth-k decision trees). Then 6 is learnable from random 
examples with error 0 in time n* - poly(n, 2%). 


3.5. Highlight: The Goldreich-Levin Algorithm 


We close this chapter by briefly describing a topic which is in some sense the 
“opposite” of learning theory: cryptography. At the highest level, cryptography 
is concerned with constructing functions which are computationally easy to 
compute but computationally difficult to invert. Intuitively, think about the task 
of encrypting secret messages: You would like a scheme where it’s easy to 
take any message x and produce an encrypted version e(x), but where it’s 
hard for an adversary to compute x given e(x). Indeed, even with examples 
e(x), ..., e(x”) of several encryptions, it should be hard for an adversary 
to learn anything about the encrypted messages, or to predict (“forge”) the 
encryption of future messages. 

A basic task in cryptography is building stronger cryptographic functions 
from weaker ones. Often the first example in “Cryptography 101” is the 
Goldreich-Levin Theorem, which is used to build a “pseudorandom genera- 
tor” from a “one-way permutation”. We sketch the meaning of these terms 
and the analysis of the construction in Exercise 3.45; for now, suffice it to say 
that the key to the analysis of Goldreich and Levin’s construction is a learn- 
ing algorithm. Specifically, the Goldreich—Levin learning algorithm solves the 
following problem: Given query access to a target function f : F} — Fo, find 
all of the linear functions (in the sense of Chapter 1.6) with which f is at 
least slightly correlated. Equivalently, find all of the noticeably large Fourier 
coefficients of f. 


Goldreich-Levin Theorem. Given query access to a target f : {—1, 1}" > 
{—1, 1} as well as input 0 < t < 1, there is a poly(n, 1/t)-time algorithm that 
with high probability outputs a list L = {U,,..., Uc} of subsets of [n] such 
that: 


-(fUzt = Vet; 
e UEL = (|fU)|>1/2. 


(By Parseval’s Theorem, the second guarantee implies that |L| < 4/t*.) 
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Although the Goldreich-Levin Theorem was originally developed for 
cryptography, it was soon put to use for learning theory. Recall that the 
“meta-algorithm” of Theorem 3.29 reduces learning an unknown target 
f :{-1, 1)" — {-1, 1} to identifying a collection F of sets on which f’s 
Fourier spectrum is €/2-concentrated. Using the Goldreich—Levin Algorithm, 
a learner with query access to f can “collect up” its largest Fourier coefficients 
until only €/2 Fourier weight remains unfound. This strategy straightforwardly 
yields the following result (see Exercise 3.39): 


Theorem 3.37. Let 6 be a concept class such that every f : {—1, 1}" > 
{—1, 1} in € has its Fourier spectrum €/4-concentrated on a collection of 
at most M sets. Then 6 can be learned using queries with error € in time 
poly(M, n, 1/e). 


The algorithm of Theorem 3.37 is often called the Kushilevitz-—Mansour 
Algorithm. Much like the Low-Degree Algorithm, it reduces the computational 
problem of learning @ (using queries) to the analytic problem of proving 
that the functions in @ have concentrated Fourier spectra. The advantage of 
the Kushilevitz—-Mansour Algorithm is that it works so long as the Fourier 
spectrum of f is concentrated on some small collection of sets; the Low- 
Degree Algorithm requires that the concentration specifically be on the low- 
degree characters. The disadvantage of the Kushilevitz—Mansour Algorithm 
is that it requires query access to f, rather than just random examples. An 
example concept class for which the Kushilevitz—Mansour Algorithm works 
well is the set of all f for which Î f Î , İs not too large: 


Theorem 3.38. Let 6 = {f : {-1, 1}" > {-1, 1} | i ffl, < s} (eg., € con- 


tains any f computable by a decision tree of size at most s). Then 6 is learnable 
from queries with error € in time poly(n, s, 1/€). 


This is proved in Exercise 3.38. 

Let’s now return to the Goldreich—Levin Algorithm itself, which seeks the 
Fourier coefficients fU ) with magnitude at least t. Given any candidate U C 
[n], Proposition 3.30 lets us easily distinguish whether the associated coefficient 
is large, | F(U)| > T, or small, |f(U)| < 1/2. The trouble is that there are 2” 
potential candidates. The Goldreich—Levin Algorithm overcomes this difficulty 
using a divide-and-conquer strategy that measures the Fourier weight of f on 
various collections of sets. Let’s make a definition: 
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Definition 3.39. Let f : {-—1,1}" —> Rand S C J C [n]. We write 
Wil Sof OUTY 
TEJ 
for the Fourier weight of f on sets whose restriction to J is S. 


The crucial tool for the Goldreich—Levin Algorithm is Corollary 3.22, which 
says that 


WEIS E Sh (3.5) 


This identity lets a learning algorithm with query access to f efficiently esti- 
mate any W5 rat f] of its choosing. Intuitively, query access to f allows query 
access to fj, for any z € {—1, iV; with this one can estimate any FS) and 
hence (3.5). More precisely: 


Proposition 3.40. For any S C J C [n] an algorithm with query access to 
f:{—1,1} > {-1, 1} can compute an estimate of WS" [ f] that is accurate 
to within te (except with probability at most 5) in time poly(n, 1/e€) - log(1/6). 


Proof. From (3.5), 


W= E [fS E | E „VO z 


z~{—1,1} z~{—1,1}7 y~{-l, 


= E E ff, DxO): fO, DxO, 

z~{=1,1} y,y'~{-1, 1}! 
where y and y’ are independent. As in Proposition 3.30, f(y, z)xs(y)- 
f(y’, Z)Xs(y’) is a +1-valued random variable that the algorithm can sam- 
ple from using queries to f. A Chernoff bound implies that O(log(1/6)/€7) 
samples are sufficient to estimate its mean with accuracy e and confidence 
1-6. 


We’re now ready to prove the Goldreich—Levin Theorem. 


Proof of the Goldreich—-Levin Theorem. We begin with an overview of how the 
algorithm works. Initially, all 2” sets U are (implicitly) put in a single “bucket”. 
The algorithm then repeats the following loop: 


e Select any bucket Z containing 2” sets, m > 1. 

e Split it into two buckets %1, B of 2-1 sets each. 

e “Weigh” each &;,i = 1, 2; i.e., estimate ee) fy. 
e Discard Bı or B, if its weight estimate is at most t7/2. 
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The algorithm stops once all buckets contain just | set; it then outputs the list 
of these sets. 

We now fill in the details. First we argue the correctness of the algorithm, 
assuming all weight estimates are accurate (this assumption is removed later). 
On one hand, any set U with IRU )| > t will never be discarded, since it 
always contributes weight at least t? > t?/2 to the bucket it’s in. On the 
other hand, no set U with IRU )| < t/2 can end up in a singleton bucket 
because such a bucket, when created, would have weight only t?/4 < t?/2 
and thus be discarded. Notice that this correctness proof does not rely on the 
weight estimates being exact; it suffices for them to be accurate to within 
+r? / 4. 

The next detail concerns running time. Note that any “active” (undiscarded) 


bucket has weight at least t?/4, even assuming the weight estimates are only 
accurate to within +1*/4. Therefore Parseval tells us there can only ever be at 
most 4/t7 active buckets. Since a bucket can be split only n times, it follows 


that the algorithm repeats its main loop at most 4n /t? times. Thus as long as 
the buckets can be maintained and accurately weighed in poly(n, 1/t) time, 
the overall running time will be poly(n, 1/t) as claimed. 

Finally, we describe the bucketing system. The buckets are indexed (and 
thus maintained implicitly) by an integer 0 < k < n anda subset S$ C [k]. The 
bucket Z, s is defined by 


Bis = {SUT TS tk+ Lk +2,...,nHf. 


Note that |Z; s| = 2”~*. The initial bucket is Ap,g. The algorithm always splits 
a bucket %2; s into the two buckets Ay +1,5 and Ar+1,surzy. The final singleton 
buckets are of the form 2, s = {S}. Finally, the weight of bucket Zg, s is 
precisely W5I4+1.--m{[ f]. Thus it can be estimated to accuracy +17/4 with 
confidence | — ô in time poly(n, 1/t) - log(1/5) using Proposition 3.40. Since 
the main loop is executed at most 4n/t? times, the algorithm overall needs 


to make at most 8n/t* weighings; by setting 6 = t7/(80n) we ensure that 
all weighings are accurate with high probability (at least 9/10). The overall 
running time is therefore indeed poly(n, 1/T). 
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3.1 Let M : F} — F} be an invertible linear transformation. Given f : F} > 
R, let foM:F5 — R be defined by f o M(x) = f(Mx). Show that 
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3.2 


3.3 


3.4 


3.6 


3.14 
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foM(y) = fim ~Ty), What if M is an invertible affine transformation? 
What if M is not invertible? 


Show that —2, is smallest constant (not depending on ô or n) that can 


be taken in Papasi 3.3. 

Generalize Proposition 3.3 by showing that any f : {—1,1}” > R is 
€-concentrated on degree up to 1/5 for € = (E[ f?] — Stab,_s[f])/ 
(d — 1/e). 

Prove Lemma 3.5 by induction on n. (Hint: If one of the subfunctions 
f(x1,.-., Xn, £1) is identically 0, show that the other has degree at 
most k — 1.) 


A 


Verify for all p € [1,00] that Î- Î p 1S a norm on the vector space of 

functions f : F} > R. 

Show that [| fil; < ÎfÎi fef, forall f, g : F? > R. 

Let f : {-1, 1}” — R and let J C [n], z € {-1, ly’. 

(a) Show that restriction reduces spectral 1-norm: Î f nefi < iif fh. 

(b) Show that it also reduces Fourier sparsity: sparsity(f7),) < 
sparsity( f). 

Let f : {—1, 1}" > Randlet0 < p < q < œ. Show that Î fj, > Î fî. 

(Cf. Exercise 1.13.) 

Let f :{-1, 1" > R. Show that fff. < Ifl and Ifl < fifth. 


(These are easy special cases of the Hausdorff-Young Inequality.) 
Suppose f : {—1, 1}” > {—1, 1} is monotone. Show that | F(S)| < fii) 
whenever i € S C [n]. Deduce that If loo = maxs{| FO} is achieved 
by an S of cardinality 0 or 1. (Hint: Apply the previous exercise to f’s 
derivatives.) 


Prove Proposition 3.12. 


Verify Parseval’s Theorem for the Fourier expansion of subspaces given 

in Proposition 3.11. 

Let f : F2 — {0, 1} be the indicator of A C F}. We know that Î f |; = 1 

if A is an affine subspace. So assume that A is not an affine subspace. 

(a) Show that there exists an affine subspace B of dimension 2 on which f 
takes the value 1 exactly 3 times. 

(b) Let b be the point in B where f is 0 and let W = gg — (1/2)@. Show 
that wf], = 1/2. o 

(c) Show that (Y, f) = 3/4 and deduce || f l| > 3/2. 

Suppose f : {—1, 1}" > R satisfies E[f?] < 1. Show that {| fj, < 2", 

and show that for any even n the upper bound can be achieved by a 

function f : {—-1, 1} > {-1, 1}. 


3.15 


3.16 


3.17 


3.20 


3.21 
3.22 


3.23 
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Given f : F5 — R, define its (fractional) sparsity to be sparsity(f) = 

|supp(f)|/2” = Prem [f(x) # 0]. In this exercise you will prove the 

uncertainty principle: If f 4 0, then sparsity(f) - sparsity(f) > 1. 

(a) Show that we may assume || f||; = 1. 

(b) Suppose F = {y : f(y) # 0}. Show that Î ffl; < |F. 

(c) Suppose @ = {x : f(x) # 0}. Show that IFI > 2”/|G|, and deduce 
the uncertainty principle. 

(d) Identify all cases of equality. 


Let f : {—1, 1}" — Rand let € > 0. Show that f is €-concentrated on a 
collection F C 2" with |F| < {j fil; fe: 

Suppose the Fourier spectrum of f : {—1, 1}" —> R is €,-concentrated 
on F and that g : {—1, 1}” — R satisfies || f — ells < €. Show that the 
Fourier spectrum of g is 2(€; + €2)-concentrated on F. 

Show that every function f : F} — R is computed by a decision tree with 
depth at most n and size at most 2”. 

Let f : F} — R be computable by a decision tree of size s and depth k 
Show that — f and the Boolean dual f are also computable by decision 
trees of size s and depth k. 

For each function in Exercise 1.1 with 4 or fewer inputs, give a decision 
tree computing it. Try primarily to use the least possible depth, and 
secondarily to use the least possible size. 

Prove Proposition 3.16. 

Let f : F} — {—1, 1} be computed by a decision tree T of size s and 
let e € (0, 1]. Suppose each path in T is truncated (if necessary) so that 
its length does not exceed log(s/e); new leaves with labels —1 and 1 
may be created in an arbitrary way as necessary. Show that the result- 
ing decisions tree T’ computes a function that is €-close to f. Deduce 
Proposition 3.17. 

A decision list is a decision tree in which every internal node has an 
outgoing edge to at least one leaf. Show that any function computable by 
a decision list is a linear threshold function. 

A read-once decision tree is one in which every internal node queries 
a distinct variable. Bearing this in mind, show that the bound k2‘~! in 
Theorem 3.4 cannot be reduced below 2* — 1. 

Suppose that f is computed by a read-once decision tree in which every 
root-to-leaf path has length k and every internal node at the deepest level 
has one child (leaf) labeled —1 one one child labeled 1. Compute the 
influence of each coordinate on f, and compute I[ f]. 


74 


3.26 
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The following are generalizations of decision trees: 

Subcube partition: This is defined by a collection Ci, ..., Cs of sub- 
cubes that form a partition of F}, along with values b),...,b; € R. It 
computes the function f : F5 —> R which has value b; on all inputs in 
C;. The subcube partition’s size is s and its “codimension” k (analogous 
to depth) is the maximum codimension of the cubes C;. 

Parity decision tree: This is similar to a decision tree except that the 
internal nodes are labeled by vectors y € F}. At such a node the compu- 
tation path on input x follows the edge labeled y - x. We insist that for 
each root-to-leaf path, the vectors appearing in its internal nodes are lin- 
early independent. Size s and depth k are defined as with normal decision 
trees. 

Affine subspace partition: This is similar to a subcube partition except 
the subcubes may be C; may be arbitrary affine subspaces. 

(a) Show that subcube partition size/codimension and parity decision 
tree size/depth generalize normal decision tree size/depth, and are 
generalized by affine subspace partition size/codimension. 

(b) Show that Proposition 3.16 holds also for the generalizations, except 
that the statement about degree need not hold for parity decision trees 
and affine subspace partitions. 

(c) Show that the class of functions with affine subspace partition size at 
most s is learnable from queries with error € in time poly(n, s, 1/€). 

Define Equ; : {—1, 1} > {-1,]} by Equ3(x) = —1 if and only if x; = 

X2 = X3. 

(a) Show that deg(Equ;) = 2. 

(b) Show that DT(Equ;) = 3. 

(c) Show that Equ, is computable by a parity decision tree of codimen- 
sion 2. 

(d) For d € N, define f{—1, 1} > {-1, 1} by f= Equ” (using 
the notation from Definition 2.6). Show that deg(f) = 2¢ but 
DT(f) = 34. 

Let f :{—-1,1}"” > R and J C [n]. Define fS” : {-1,1}" > R by 

f(x) = Ey.) yz fs, y)], where xy € {—1, 1}/ is the projection of 

x to coordinates J. Verify the Fourier expansion 


fT =D FS) xs. 


STJ 
Let ọ : F5 > RZ? be a probability density function corresponding to 
probability distribution @ on F5. Let J C [n]. 
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(a) Consider the marginal probability distribution of ¢ on coordinates J. 
What is its probability density function (a function F3 + R=°) in 
terms of o? 

(b) Consider the probability distribution of ¢ conditioned on a substring 
ZE FZ. Assuming it’s well defined, what is its probability density 
function in terms of Y? 

Suppose f : {-1, 1}” > R is computable by a decision tree that has a 

leaf at depth k labeled b. Show that jl f floo > |b|/2*. (Hint: You may find 

Exercise 3.28 helpful.) 

Prove Fact 3.25 by using Theorem 1.27 and Exercise 1.1(d). 


(a) Suppose f: F —> R has sparsity( P) ) < 2”. Show that for _any 
y € supp( F) ) there exists nonzero f € Fr such that fg+ has Fi (y) 
as a Fourier coefficient. 

(b) Prove by induction on n that if f : F5 —> {—1, 1} has sparsity(f) = 
s > I then fis 2!-logs]_øranular. (Hint: Distinguish the cases s = 2” 
and s < 2”. In the latter case use part (a).) 

(c) Prove that there are no functions f : {—1, 1}” —> {—1, 1} with 
sparsity(f) € {2, 3,5, 6, 7, 9}. 

Show that one can learn any target f : {—1, 1}” — {-1, 1} with error 0 

from random examples only in time o (2”). 

Improve Proposition 3.31 as follows. Suppose f : {—1, 1} > {-1, 1} 

and g : {—1, 1}" > R satisfy || f — gll1 < €. Pick @ € [—1, 1] uniformly 

at random and define h : {—1, 1}” —> {-1, 1} by h(x) = sgn(g(x) — 0). 

Show that E[dist( f, h)] < €/2. 

(a) For n even, find a function f : {—1, 1}” —> {-1, 1} that is not 1/2- 
concentrated on any F C 2!! with |F| < 2”~'. (Hint: Exercise 1.1.) 

(b) Let f : {—1, 1}" — {-1, 1} be a random function as in Exercise 1.7. 
Show that with probability at least 1/2, f is not 1/4-concentrated on 
degree up to [n/2]. 

Prove Theorem 3.36. (Hint: In light of Exercise 1.11 you may round off 

certain estimates with confidence.) 

Show that each of the following classes @ (ordered by inclusion) can be 

learned exactly (i.e., with error 0) using queries in time poly(n, 2%): 

(a) €= {f :{-1, 1}" > {-1,1} | f is a k-junta}. (Hint: Estimate 
influences.) 

(6) €={f :{-1, 1} > {-1, U | DTP) < k}. 

(c) E= {f :{-1, 1}" —> {-1, 1} | sparsity(f) < 2°}. (Hint: Exer- 
cise 3.32.) 
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3.38 Prove Theorem 3.38. (Hint: Exercise 3.16.) 
3.39 Deduce Theorem 3.37 from the Goldreich—Levin Algorithm. 


3.40 Suppose A learns @ from random examples with error €/2 in time T — 
with probability at least 9/10. 
(a) After producing hypothesis h on target f : {—1, 1} > {-1, 1}, 


(b 


3.41 (a 


(b 


wm 


) 


wm 


show that A can “check” whether is a good hypothesis in time 
poly(n, T, 1/e) - log(1/5). Specifically, except with probability at 
most ô, A should output ‘YES’ if dist(f, h) < €/2 and ‘NO’ if 
dist( f, h) > e. (Hint: Time poly(T) may be required for A to evaluate 
h(x).) 

Show that for any ô € (0, 1/2], there is a learning algorithm that learns 
with error € in time poly(n, T, €) - log(1/5) — with probability at 
least 1 — 6. 

Our description of the Low-Degree Algorithm with degree k and 
error € involved using a new batch of random examples to estimate 
each low-degree Fourier coefficient. Show that one can instead simply 
draw a single batch & of poly(n*, 1/€) examples and use & to estimate 
each of the low-degree coefficients. 

Show that when using the above form of the Low-Degree Algorithm, 
the final hypothesis h : {—1, 1}” —> {—1, 1} is of the form 


h(yy=sen} $ wao- fe]. 
(x, f(x))eE 


for some function w : {0, 1, ..., n} —> R. In other words, the hypoth- 
esis on a given y is equal to a weighted vote over all examples seen, 
where an example’s weight depends only on its Hamming distance 
to y. Simplify your expression for w as much as you can. 


3.42 Extend the Goldreich—Levin Algorithm so that it works also for functions 
f :{-1, 1} > [-1, 1]. (The learning model for targets f : {—1, 1} —> 
[—1, 1] assumes that f(x) is always a rational number expressible by 


poly(n) bits.) 
3.43 (a) Assume y, y’ € R are distinct. Show that Pry [y -x = y’- x] = 1/2. 
(b) Fix y € F} and suppose x, ..., x ~ F} are drawn uniformly and 


independently. Show thatifm = Cn for C a sufficiently large constant 
then with high probability, the only y’ € F} satisfying y’- x = 
y- x® forall i € [m] is y’ = y. 


(c) Essentially improve on Exercise 1.27 by showing that the concept 


class of all linear functions F —> F, can be learned from random 
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examples only, with error 0, in time poly(n). (Remark: If w € R is 
such that x n matrix multiplication can be done in O(n”) time, then 
the learning algorithm also requires only O(n”) time.) 

3.44 Let t > 1/2 +€ for some constant € > 0. Give an algorithm simpler 
than Goldreich and Levin’s that solves the following problem with 
high probability: Given query access to f : {—1, 1} — {-1, 1}, in time 
poly(n, 1/e) find the unique U C [n] such that | F(U)| > T, assuming it 
exists. (Hint: Use Proposition 1.31 and Exercise 1.27.) 

3.45 Informally: a “one-way permutation” is a bijective function f : F} > 
F3 that is easy to compute on all inputs but hard to invert on more 
than a negligible fraction of inputs; a “pseudorandom generator” is a 
function g : F% — F7 form > k whose output on a random input “looks 
unpredictable” to any efficient algorithm. Goldreich and Levin proposed 
the following construction of the latter from the former: for k = 2n, 
m = 2n + 1, define 


gr, s) = (r, f(s), r > s), 


where r, s € F}. When g’s input (7, s) is uniformly random, then so is the 

first 2n bits of its output (using the fact that f is a bijection). The key to 

the analysis is showing that the final bit, r - s, is highly unpredictable to 

efficient algorithms even given the first 2n bits (r, f(s)). This is proved 

by contradiction. 

(a) Suppose that an adversary has a deterministic, efficient algorithm A 
good at predicting the bit r - s: 


1 
ERAN: f6) =r:s]> 5 +y. 


Show there exists B C F with |B|/2” > 5y such that 


1 
2 


1 1 
pe a f(s) =r-s|= 5 tS 


2 
for alls € B. 
(b) Switching to +1 notation in the output, deduce Aj fœ)(s) > y for all 
se B. 


(c) Show that the adversary can efficiently compute s given f(s) (with 
high probability) for any s € B. If y is nonnegligible, this contradicts 
the assumption that f is “one-way”. (Hint: Use the Goldreich—Levin 
Algorithm.) 

(d) Deduce the same conclusion even if A is a randomized algorithm. 
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Notes 


The fact that the Fourier characters x, : F} —> {—1, 1} form a group isomorphic to F} 
is not a coincidence; the analogous result holds for any finite abelian group and is a 
special case of the theory of Pontryagin duality in harmonic analysis. We will see further 
examples of this in Chapter 8. 

Regarding spectral structure, Karpovsky (Karpovsky, 1976) proposed sparsity( f) as 
a measure of complexity for the function f. Brandman’s thesis (Brandman, 1987) (see 
also (Brandman et al., 1990)) is an early work connecting decision tree and subcube 
partition complexity to Fourier analysis. The notation introduced for restrictions in 
Section 3.3 is not standard; unfortunately there is no standard notation. The uncertainty 
principle from Exercise 3.15 dates back to Matolcsi and Szücs (Matolcsi and Szücs, 
1973). The result of Exercise 3.13 is due to Green and Sanders (Green and Sanders, 
2008), with inspiration from Saeki (Saeki, 1968). The main result of Green and Sanders 
is the sophisticated theorem that any f : F5 — {0, 1} with { f hi < s can be expressed 
as Ye ,t1y,, where L < 228) and each H; < F}. 

Theorem 3.4 is due to Nisan and Szegedy (Nisan and Szegedy, 1994). That work 
also showed a nontrivial kind of converse to the first statement in Proposition 3.16: Any 
f :{—1, 1}" — {—1, 1} is computable by a decision tree of depth at most poly(deg( f )). 
The best upper bound currently known is deg(f)* due to Midrijanis (Midrijanis, 2004). 
Nisan and Szegedy also gave the example in Exercise 3.27 showing the dependence 
cannot be linear. 

The field of computational learning theory was introduced by Valiant in 
1984 (Valiant, 1984); for a good survey with focus on learning under the uniform dis- 
tribution, see the thesis by Jackson (Jackson, 1995). Linial, Mansour, and Nisan (Linial 
et al., 1993) pioneered the Fourier approach to learning, developing the Low-Degree 
Algorithm. We present their strong results on constant-depth circuits in Chapter 4. The 
noise sensitivity approach to the Low-Degree Algorithm is from Klivans, O’ Donnell, and 
Servedio (Klivans et al., 2004). Corollary 3.33 is due to Bshouty and Tamon (Bshouty 
and Tamon, 1996) who also gave certain matching lower bounds. Goldreich and Levin’s 
work dates from 1989 (Goldreich and Levin, 1989). Besides its applications to cryp- 
tography and learning, it is important in coding theory and complexity as a local 
list-decoding algorithm for the Hadamard code. The Kushilevitz—Mansour algorithm is 
from their 1993 paper (Kushilevitz and Mansour, 1993); they also are responsible for 
the results of Exercise 3.37(b) and 3.38. The results of Exercise 3.32 and 3.37(c) are 
from Gopalan et al. (Gopalan et al., 2011). 


4 
DNF Formulas and Small-Depth Circuits 


In this chapter we investigate Boolean functions representable by small DNF 
formulas and constant-depth circuits; these are significant generalizations of 
decision trees. Besides being natural from a computational point of view, these 
representation classes are close to the limit of what complexity theorists can 
“understand” (e.g., prove explicit lower bounds for). One reason for this is that 
functions in these classes have strong Fourier concentration properties. 


4.1. DNF Formulas 


One of the commonest ways of representing a Boolean function f : {0, 1}” > 
{0, 1} is by a DNF formula: 


Definition 4.1. A DNF (disjunctive normal form) formula over Boolean vari- 
ables x1,...,X, is defined to be a logical OR of terms, each of which is a 
logical AND of literals. A literal is either a variable x; or its logical nega- 
tion x;. We insist that no term contains both a variable and its negation. The 
number of literals in a term is called its width. We often identify a DNF formula 
with the Boolean function f : {0, 1}” — {0, 1} it computes. 


Example 4.2. Recall the function Sorts, defined by Sort3(x,, x2, x3) = 1 if and 
only if x; < x2 < x3 or xı > x2 > x3. We can represent it by a DNF formula 
as follows: 


Sort3(x1, x2, x3) = (x1 Ax2) V (%2 A X3) V (X1 A X3). 


The DNF representation says that the bits are sorted if either the first two bits 
are 1, or the last two bits are 0, or the first bit is O and the last bit is 1. 


The complexity of a DNF formula is measured by its size and width: 
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Definition 4.3. The size of a DNF formula is its number of terms. The width 
is the maximum width of its terms. Given f : {—1, 1}” —> {—1, 1} we write 
DNF;ize( f) (respectively, DNFwian( f )) for the least size (respectively, width) 
of a DNF formula computing f. 


The DNF formula for Sort; from Example 4.2 has size 3 and width 2. Every 
function f : {0, 1}” — {0, 1} can be computed by a DNF of size at most 2” 
and width at most n (Exercise 4.1). 

There is also a “dual” notion to DNF formulas: 


Definition 4.4. A CNF (conjunctive normal form) formulas is a logical AND 
of clauses, each of which is a logical OR of literals. Size and width are defined 
as for DNFs. 


Some functions can be represented much more compactly by CNFs than 
DNFs (see Exercise 4.14). On the other hand, if we take a CNF computing f and 
switch its ANDs and ORs, the result is a DNF computing the dual function ft 
(see Exercises 1.8 and 4.2). Since f and ft have essentially the same Fourier 
expansion, there isn’t much difference between CNFs and DNFs when it comes 
to Fourier analysis. We will therefore focus mainly on DNFs. 

DNFs and CNFs are more powerful than decision trees for representing 
Boolean-valued functions, as the following proposition shows: 


Proposition 4.5. Let f : {0, 1}” — {0, 1} be computable by a decision tree T 
of size s and depth k. Then f is computable by a DNF (and also a CNF) of size 
at most s and width at most k. 


Proof. Take each path in T from the root to a leaf labeled 1 and form the 
logical AND of the literals describing the path. These are the terms of the 
required DNF. (For the CNF clauses, take paths to label 0 and negate all literals 
describing the path.) 


Example 4.6. If we perform this conversion on the decision tree computing 
Sort; in Figure 3.1 we get the DNF 


X1 AX3AX2) V (X1 A x3) V (X41 AX2AX3) V (x2 A x3). 


This has size 4 (indeed at most the decision tree size 6) and width 3 (indeed at 
most the decision tree depth 3). It is not as simple as the equivalent DNF from 
Example 4.2, though; DNF representation is not unique. 


The class of functions computable by small DNFs is intensively studied 
in learning theory. This is one reason why the problem of analyzing spectral 
concentration for DNFs is important. Let’s begin with the simplest method 
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for this: understanding low-degree concentration via total influence. We will 


switch to +1 notation. 


Proposition 4.7. Suppose that f : {—1, 1}" > {—1, 1} has DNFwian(f) < w. 
Then I[ f] < 2w. 

Proof. We use Exercise 2.10, which states that 

I[f]=2 E i [# (—1)-pivotal coordinates for f on x], 

where coordinate 7 is “(—1)-pivotal” on input x if f(x) = —1 (logical True) but 
f(x®') = 1 (logical False). It thus suffices to show that on every input x there 
are at most w coordinates which are (—1)-pivotal. To have any (—1)-pivotal 
coordinates at all on x we must have f(x) = —1 (True); this means that at least 
one term T in f’s width-w DNF representation must be made True by x. But 
now if i is a (—1)-pivotal coordinate then either x; or x; must appear in T; 
otherwise, T would still be made true by x®’. Thus the number of (—1)-pivotal 
coordinates on x is at most the number of literals in 7, which is at most w. 


Since I[ f'] =I[f] the proposition is also true for CNFs of width at 
most w. The proposition is very close to being tight: The parity function 
Xw) 2 {-1, 1}" = {—1, 1} has I[Xiw]] = w and DNFyiatn(X{w) < w (the latter 
being true for all w-juntas). In fact, the proposition can be improved to give the 
tight upper bound w (Exercise 4.17). 

Using Proposition 3.2 we deduce: 


Corollary 4.8. Let f : {—1, 1} —> {-1, 1} have DNFyiatn(f) < w. Then for 
€ > 0, the Fourier spectrum of f is €-concentrated on degree up to 2w/e. 


The dependence here on w is of the correct order (by the example of the 
parity X{w] again), but the dependence on e€ can be significantly improved as 
we will see in Section 4.4. 

There’s usually more interest in DNF size than in DNF width; for example, 
learning theorists are often interested in the class of n-variable DNFs of size 
poly(n). The following fact (similar to Exercise 3.22) helps relate the two, 
suggesting O(log) as an analogous width bound: 


Proposition 4.9. Let f : {—1, 1}" —> {—1, 1} be computable by a DNF (or 
CNF) of size s and let € € (0, 1]. Then f is €-close to a function g computable 
by a DNF of width log(s /€). 


Proof. Take the DNF computing f and delete all terms with more than log(s /e) 
literals; let g be the function computed by the resulting DNF. For any deleted 
term T, the probability a random input x ~ {—1, 1}” makes T true is at most 
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2718/6 — e/s. Taking a union bound over the (at most s) such terms shows 
that Pr[g(x) Æ f(x)] < €. (A similar proof works for CNFs.) 


By combining Proposition 4.9 and Corollary 4.8 we can deduce (using 
Exercise 3.17) that DNFs of size s have Fourier spectra €-concentrated up 
to degree O(log(s/e)/e). Again, the dependence on € will be improved in 
Section 4.4. We will also later show in Section 4.3 that size-s DNFs have total 
influence at most O(log s), something we cannot deduce immediately from 
Proposition 4.7. 

In light of the Kushilevitz—Mansour learning algorithm it would also be 
nice to show that poly(7)-size DNFs have their Fourier spectra concentrated 
on small collections (not necessarily low-degree). In Section 4.4 we will show 
they are €-concentrated on collections of size n8198 for any constant € > 0. 
It has been conjectured that this can be improved to poly(n): 


Mansour’s Conjecture. Let f : {—1, 1}” —> {—1, 1} be computable by a DNF 
of size s > l and let € € (0, 1/2]. Strong conjecture: f’s Fourier spectrum is 
€-concentrated on a collection F with |F| < s°"°2"/), Weaker conjecture: if 
s < poly(n) and € > 0 is any fixed constant, then we have the bound |F| < 


poly(n). 


4.2. Tribes 


In this section we study the tribes DNF formulas, which serve as an important 
examples and counterexamples in analysis of Boolean functions. Perhaps the 
most notable feature of the tribes function is that (for a suitable choice of 
parameters) it is essentially unbiased and yet all of its influences are quite 
tiny. 

Recall from Chapter 2.1 that the function Tribes,,,, : {—1, 1}*” — {-1, 1} 
is defined by its width-w, size-s DNF representation: 


Tribes, 5(x1, cea Xw sees Ms—lwtls+++s Xsw) 


= (1 A+ A Xw) Vivre V (Xew Ao A Xsw)- 


(We are using the notation where —1 represents logical True and 1 represents 
logical False.) As is computed in Exercise 2.13 we have: 


Fact 4.10. Pr,[Tribes,,,(x) = —1] =1—(1—2>-”Y. 


The most interesting setting of parameters makes this probability as close 
to 1/2 as possible (a slightly different choice than the one in Exercise 2.13): 
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Definition 4.11. For w € Nt, let s = s, be the largest integer such that 
1 — (1—27) < 1/2. Then forn = n, = sw we define Tribes, : {—1, 1}” > 
{—1, 1} to be Tribes„ s. Note this is only defined only for certain n: 1, 4, 15, 
40,... 


Here s ~ In(2)2”, hence n ~ In(2)w2” and therefore w ~ logn — logInn 
and s ~ n/logn. A slightly more careful accounting (Exercise 4.5) yields: 


Proposition 4.12. For the Tribes, function as in Definition 4.11: 


e s = In(2)2” — 0,(1); 
e n = In(2)w2” — O(w), thus nw+1 = (2 + 0(1))nw; 
e w= logn — loglnn + 0,(1), and 2” = 4(1+0,(1)); 


Inn 


° Pr[Tribes,(x) = —1] = 1/2 — O (==). 


Thus with this setting of parameters Tribes, is essentially unbiased. Regard- 
ing its influences: 


Proposition 4.13. Inf;[Tribes,] = mq + o(1)) for each i € [n] and hence 
I[Tribes,] = (n n)(1  o(1)). 


Proof. Thinking of Tribes, = Tribes,,,, as a voting rule, voter i is pivotal if 
and only if: (a) all other voters in i’s “tribe” vote — 1 (True); (b) all other tribes 
produce the outcome 1 (False). The probability of this is indeed 


2°")... —2-") | = z4 - Pr[Tribes, = 1] = ™2(1 + 0(1)), 


where we used Fact 4.10 and then Proposition 4.12. 


Thus if we are interested in (essentially) unbiased voting rules in which 
every voter has small influence, Tribes, is a much stronger example than Maj,, 
where each voter has influence @(1/,/n). You may wonder if the maximum 
influence can be even smaller than (= 
it can’t be smaller than L, since the Poincaré Inequality says that I[ f] > 1 for 
unbiased f. In fact the famous KKL Theorem shows that the Tribes, example 


is tight up to constants: 


) for unbiased voting rules. Certainly 


Kahn-Kalai-Linial (KKL) Theorem. For any f : {—1, 1}" —> {-1, 1}, 


l 
Maxinf[ f] = max{Inf;[f]) > Var[ f]: 2( =e i 
te[n n 
We prove the KKL Theorem in Chapter 9. 
We conclude this section by recording a formula for the Fourier coefficients 
of Tribes,,,;. The proof is Exercise 4.6. 
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Proposition 4.14. Suppose we index the Fourier coefficients of 
Tribes, s{—1, 1” — {—1, 1} by sets T = (Ti, ..., Ts) C [sw], where T; is 
the intersection of T with the ith “tribe”. Then 


ULSS 22S] if T = Ø, 


Tribes,, ,(T) = 
a) 2(—1)+ITI2-k( — 2-¥y-k ifk = Hi T; £ Ø} > O. 


4.3. Random Restrictions 


In this section we describe the method of applying random restrictions. This is a 
very “Fourier-friendly” way of simplifying a Boolean function. As motivation, 
let’s consider the problem of bounding total influence for size-s DNFs. One 
plan is to use the results from Section 4.1: size-s DNFs are .01-close to width- 
O(log s) DNFs, which in turn have total influence O(log s). This suggests that 
size-s DNFs themselves have total influence O(log s). To prove this though 
we'll need to reverse the steps of the plan; instead of truncating DNFs to a 
fixed width and arguing that a random input is unlikely to notice, we’ll first 
pick a random (partial) input and argue that this is likely to make the width 
small. 
Let’s formalize the notion of a random partial input, or restriction: 


Definition 4.15. For ô € [0, 1], we say that J is a 6-random subset of N if it 
is formed by including each element of N independently with probability ô. 
We define a 6-random restriction on {—1, 1}”" to be a pair (J | z), where first 
J is chosen to be a 6-random subset of [n] and then z ~ {—1, 17 is cho- 
sen uniformly at random. We say that coordinate i € [n] is free if i € J and 
is fixed if i ¢ J. An equivalent definition is that each coordinate i is (inde- 
pendently) free with probability ô and fixed to +1 with probability (1 — 5)/2 
each. 


Given f : {—1, 1}" — R and a random restriction (J | z), we can form the 
restricted function fy), : {—1, 1}/ > R as usual. However, it’s inconvenient 
that the domain of this function depends on the random restriction. Thus when 
dealing with random restriction we usually invoke the following convention: 


Definition 4.16. Given f :{—1, 1} —> R, I C [n], and z € {-1, 1¥, we 
may identify the restricted function fr :{—1, 1¥ > R with its extension 
fuz : {—1, 1}" —> R in which the input coordinates {—1, 1}! are ignored. 


As mentioned, random restrictions interact nicely with Fourier expansions: 
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Proposition 4.17. Fix f : {—1, 1}” —> R and S C [n]. Then if (J | z) is a 
6-random restriction on {—1, 1}", 
El fyi2(S)] = Pris € J1 f(S) = 3! F09), 


and 


Elf] = X Prlun J = s) fy = Y a — 8) RU’, 


Ucin] U2S 


where we are treating f\z as a function {—1, 1}" > R. 


Proof. Suppose first that J C [n] is fixed. When we think of restricted functions 
fijz as having domain {—1, 1}", Corollary 3.22 may be stated as saying that 
for any S C [n], 


E p PSN = f(S)-Iscy, 


a pfo] =) FWY - lurs=s. 


ear es Ucin] 


The proposition now follows by taking the expectation over J. 
Corollary 4.18. Fix f :{—1, 1}” —> R and i € [n]. If (J | z) is a -random 
restriction, then E[Inf;[ f7\,]] = ôInf;[ f]. Hence also E[I[ fjz]] = SI f]. 


Proof. We have 


E[Inf;[fyzl] = E j Ts =X X Pru nJ = SI fU’ 


Sai Sai UC[n] 


= X` Pr[U AJ a fU’ = X8 fU’ = ôMf;[ f], 


UC[n] Udi 


where the second equality used Proposition 4.17. 


(Proving Corollary 4.18 via Proposition 4.17 is a bit more elaborate than 
necessary; see Exercise 4.9.) 

Corollary 4.18 lets us bound the total influence of a function f by bounding 
the (expected) total influence of a random restriction of f. This is useful if f 
is computable by a DNF formula of small size, since a random restriction is 
very likely to make this DNF have small width. This is a consequence of the 
following lemma: 
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Lemma 4.19. LetT bea DNF termover{—1, 1}" andfixw € N*. Let (J | z)be 
a (1/2)-random restriction on {—1, 1}". Then Pr[width(T jz) > w] < (3/4)”. 


Proof. We may assume the initial width of T is at least w, as otherwise its 
restriction under (J | z) cannot have width at least w. Now if any literal appear- 
ing in T is fixed to False by the random restriction, the restricted term Ty), will 
be constantly False and thus have width 0 < w. Each literal is fixed to False 
with probability 1/4; hence the probability no literal in T is fixed to False is at 
most (3/4)”. 


We can now bound the total influence of small DNF formulas. 


Theorem 4.20. Let f : {—1, 1}” —> {—1, 1} be computable by a DNF of size s. 
Then I[ f] < Odogs). 


Proof. Let (J|z) be a (1/2)-random restriction on {—1,1}". Let 
w = DNFwian( fy). By a union bound and Lemma 4.19 we have that 
Pr[w > w] < s(3/4)”. Hence 


E(w] = X` Pr[w > w] <3logs+ S > s8/9” 


w=1 w>3logs 


< 3logs + 4s(3/4P 85 < 3logs + 4/s°? = O(log s). 


From Proposition 4.7 we obtain E[I[ f7,]] < 2- O(log s) = O(log s). And so 
from Corollary 4.18 we conclude I[ f] = 2 E[I[ f7),]] < O(log s). 


4.4. Hastad’s Switching Lemma and the Spectrum of DNFs 


Let’s further investigate how random restrictions can simplify DNF formulas. 
Suppose f is computable by a DNF formula of width w, and we apply to it a 
6-random restriction with 6 < 1/w. For each term T in the DNF, one of three 
things may happen to it under the random restriction. First and by far most 
likely, one of its literals may be fixed to False, allowing us to delete it. If this 
doesn’t happen, the second possibility is that all of T’s literals are made True, 
in which case the whole DNF reduces to the constantly True function. With 
ô < 1/w, this is in turn much more likely than the third possibility, which is 
that at least one of T’s literals is left free, but all the fixed literals are made 
True. Only in this third case is T not trivialized by the random restriction. 
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This reasoning might suggest that f is likely to become a constant function 
under the random restriction. Indeed, this is true, as the following theorem 
shows: 


Baby Switching Lemma. Let f : {—1, 1}" —> {—1, 1} be computable by a 
DNF or CNF of width at most w and let (J | z) be a 5-random restriction. 
Then 


Pr[ f jz is not a constant function] < 58w. 
This is in fact the k = 1 case of the following much more powerful theorem: 


Hastad’s Switching Lemma. Let f : {—1, 1}" > {—1, 1} be computable by 
a DNF or CNF of width at most w and let (J | z) be a 6-random restriction. 
Then for any k € N, 


Pr[DT( fj) > k] < (68w. 


What is remarkable about this result is that it has no dependence on the 
size of the DNF, or on n. In words, Hastad’s Switching Lemma says that when 
ô < 1/w, it’s exponentially unlikely (in k) that applying a 6-random restriction 
to a width-w DNF does not convert (“switch”) it to a decision tree of depth 
less than k. The result is called a “lemma” for historical reasons; in fact, its 
proof requires some work. You are asked to prove the Baby Switching Lemma 
in Exercise 4.19; for Hastad’s Switching Lemma, consult Hastad’s original 
proof (Hastad, 1987) or the alternate proof of Razborov (Razborov, 1993; 
Beame, 1994). 

Since we have strong results about the Fourier spectra of decision trees 
(Proposition 3.16), and since we know random restrictions interact nicely with 
Fourier coefficients (Proposition 4.17), Hastad’s Switching Lemma allows us to 
prove some strong results about Fourier concentration of narrow DNF formulas. 
We start with an intermediate result which will be of use: 


Lemma 4.21. Let f : {-1, 1}" — {-1, 1} and let (J |z) be a 6-random 
restriction, 6 > 0. Fix k € N* and write € = Pr[DT(f7\z) = k]. Then the 
Fourier spectrum of f is 3€-concentrated on degree up to 3k/6. 


Proof. The key observation is that DT( f7),) < k implies deg( f Jiz) < k (Propo- 
sition 3.16), in which case the Fourier weight of f7;, at degree k and above 
is 0. Since this weight at most 1 in all cases we conclude 


El D Frcs] <e. 


SC[n] 
|S|2k 
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Using Proposition 4.17 we have 


EL S] = E B= D Priluk RUP. 
(Jz) sch ScD) (Jz) Ucin] (Jz) 


The distribution of random variable |U N J| is Binomial(|U|, 5). When |U| > 
3k/6 this random variable has mean at least 3k, and a Chernoff bound shows 
Pr[[U A J| < k] < exp(— 3k) < 2/3. Thus 


e=} Prilunsize-fuy= $, 0-2/3- fy 


Ucin] |U|>3k/65 


and hence Vui>3K/s fU? < 3e as claimed. 


We can now improve the dependence on € in Corollary 4.8’s low-degree 
spectral concentration for DNFs: 


Theorem 4.22. Suppose f : {—1, 1}” —> {-1, 1} is computable by a DNF 
of width w. Then f’s Fourier spectrum is €-concentrated on degree up 


to O(w log(1/e)). 


Proof. This follows immediately from Hastad’s Switching Lemma and 
Lemma 4.21, taking 6 = ae and k = C log(1/e) for a sufficiently large con- 
stant C. 


In Lemma 4.21, instead of using the fact that depth-k decision trees have no 
Fourier weight above degree k, we could have used the fact that their Fourier 
1-norm is at most 2*. As you are asked to show in Exercise 4.11, this would 
yield: 


Lemma 4.23. Let f : {—1, 1} —> {-1, 1} and let (J |z) be a 5-random 


restriction. Then 


be §lUl | f(U)| <E [2PM Fad], 
(J|z) 
Ucin] 


We can combine this with the Switching Lemma to deduce that width-w 
DNFs have small Fourier 1-norm at low degree: 


Theorem 4.24. Suppose f :{—1, 1}” —> {—1, 1} is computable by a DNF of 
width w. Then for any k, 


X IFW) <2- 20w. 


|U|sk 
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Proof. Apply Hastad’s Switching Lemma to f with 6 = a to deduce 
[o0] 


E [2PM] < 5i. 21 — 2. 
dial 2 (35) 


Thus from Lemma 4.23 we get 


22 D (ae) IFO = (aa) Yo IFW), 


Ucin] |U|sk 


as needed. 


Our two theorems about the Fourier structure of DNF are almost enough to 
prove Mansour’s Conjecture: 


Theorem 4.25. Let f : {—1,1}" —> {-1, 1} be computable by a DNF of 
width w > 2. Then for any e € (0,1/2], the Fourier spectrum of f is 
€-concentrated on a collection F with |F| < w? 84/9), 


Proof. Letk = Cw log(4/e) and let g = f=. If C is a large enough constant, 
then Theorem 4.22 tells us that || f — ell < €/4. Furthermore, Theorem 4.24 
gives [igi], < w2% 181/9, By Exercise 3.16, g is (€/4)-concentrated on some 
collection F with |F| < Afleil;/e < wemles/) And so by Exercise 3.17, 
f is €-concentrated on this same collection. 


For the interesting case of DNFs of width O(logn) and constant €, we 
get concentration on a collection of cardinality O(log n)? %2™ = nO(estos”), 
nearly polynomial. Using Proposition 4.9 (and Exercise 3.17) we get the same 
deduction for DNFs of size poly(n); more generally, for size s we have e- 
concentration on a collection of cardinality at most (s /e) 0" 8/0 los1/e)) 


4.5. Highlight: LMN’s Work on Constant-Depth Circuits 


Having derived strong results about the Fourier spectrum of small DNFs and 
CNFs, we will now extend to the case of constant-depth circuits. We begin 
by describing how Hastad applied his Switching Lemma to constant-depth 
circuits. We then describe some Fourier-theoretic consequences coming from 
a very early (1989) work in analysis of Boolean functions by Linial, Mansour, 
and Nisan (LMN). 

To define constant-depth circuits it is best to start with a picture. Figure 4.1 
shows an example of a depth-3 circuit. 
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x1 X1 X2 X2 X3 X3 x4 X4 


Figure 4.1. Example of a depth-3 circuit, with the layer 0 nodes at the bottom 
and the layer 3 node at the top 


This circuit computes the function 
XX2 A 1x3 V x3x4) A (x3x4 V X2), 
where we suppressed the ^ in concatenated literals. To be precise: 


Definition 4.26. For an integer d > 2, we define a depth-d circuit over Boolean 
variables x,,...,X, as follows: It is a directed acyclic graph in which the 
nodes (“gates”) are arranged in d + 1 layers, with all arcs (“wires”) going 
from layer j — 1 to layer j for some j € [d]. There are exactly 2n nodes 
in layer 0 (the “inputs”) and exactly 1 node in layer d (the “output’). The 
nodes in layer O are labeled by the 2n literals. The nodes in layers 1, 3, 5, 
etc. have the same label, either A or V, and the nodes in layers 2, 4, 6, etc. 
have the other label. Each node “computes” a function {—1, 1}” > {—1, 1}: 
the literals compute themselves and the A (respectively, V) nodes compute the 
logical AND (respectively, OR) of the functions computed by their incoming 
nodes. The circuit itself is said to compute the function computed by its output 
node. 


In particular, DNFs and CNFs are depth-2 circuits. We extend the definitions 
of size and width appropriately: 


Definition 4.27. The size of a depth-d circuit is defined to be the number of 
nodes in layers 1 through d — 1. Its width is the maximum in-degree of any 
node at layer 1. (As with DNFs and CNFs, we insist that no node at layer 1 is 
connected to a variable or its negation more than once.) 


The layering we assume in our definition of depth-d circuits can be achieved 
with a factor-2d size overhead for any “unbounded fan-in AND/OR/NOT 
circuit”. We will not discuss any other type of Boolean circuit in this 
section. 

We now show that Hastad’s Switching Lemma can be usefully applied not 
just to DNFs and CNFs but more generally to constant-depth circuits: 
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Lemma 4.28. Let f : {—1, 1}" —> {—1, 1} be computable by a depth-d circuit 
of size s and width w, and let € € (0, 1]. Set 


1 1 \¢-2 
E (a) , Where £ = log(2s/e). 
w 


Then if (J | z) is a -random restriction, Pr[DT( f jz) = log(2/e)] < €. 


Proof. The d = 2 case is immediate from Hastad’s Switching Lemma, so we 
assume d > 3. 

The first important observation is that random restrictions “compose”. That 
is, making a 6,-random restriction followed by a 62-random restriction to the 
free coordinates is equivalent to making a 6;6-random restriction. Thus we 
can think of (J | z) as being produced as follows: 


(1) make a a random restriction; 


1 
(2) make d — 3 subsequent 7; 


(3) make a final īa random restriction. 


-random restrictions; 


Without loss of generality, assume the nodes at layer 2 of the circuit are 
labeled v. Thus any node g at layer 2 computes a DNF of width at most w. 
By Hastad’s Switching Lemma, after the initial Tez random restriction g can 
be replaced by a decision tree of depth at most £ except with probability at 
most 2~°. In particular, it can be replaced by a CNF of width at most £, using 
Proposition 4.5. If we write s2 for the number of nodes at layer 2, a union bound 
lets us conclude: 


Pr [not all nodes at layer 2 replaceable by width-€ CNFs] < sz -2~°. 


dns Be 
Tog random 


restriction 
(4.1) 

We now come to the second important observation: If all nodes at layer 2 
can be switched to width- CNFs, then layers 2 and 3 can be “compressed”, 
producing a depth-(d — 1) circuit of width at most £. More precisely, we can 
form an equivalent circuit by shortening all length-2 paths from layer 1 to 
layer 3 into single arcs, and then deleting the nodes at layer 2. We give an 
illustration of this in Figure 4.2. 

Assuming the event in (4.1) does not occur, the initial x -random restriction 
reduces the circuit to having depth-(d — 1) and width at most £. The number 
of A-nodes at the new layer 2 is at most s3, the number of nodes at layer 3 in 
the original circuit. 

Next we make a q5 -random restriction. As before, by Hastad’s Switching 
Lemma this reduces all width-€ CNFs at the new layer 2 to depth-£ decision 
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x1 X1 


x1 X1 x2 X2 x4 X4 


x1 X1 x2 X2 X4 X4 


Figure 4.2. At top is the initial circuit. Under the restriction fixing x3 = True, 
all three DNFs at layer 2 may be replaced by CNFs of width at most 2. Finally, 
the nodes at layers 2 and 3 may be compressed. 


trees (hence width-€ DNFs), except with probability at most s3 -2~°. We may 
then compress layers and reduce depth again. 

Proceeding for all Iaz random restrictions except the final one, a union 
bound gives 


Pr [circuit does not reduce to depth 2 and width £] 
1 1 \d-3 
Taw (tor) -random 3 e e e 
restriction <s:2 + s32 +- + 591-2 <s-2 =e€/2. 
Assuming the event above does not occur, Hastad’s Switching Lemma tells us 
that the final az random restriction reduces the circuit to a decision tree of 
depth less than log(2/€) except with probability at most €/2. This completes 


the proof. 


We may now obtain the main theorem of Linial, Mansour, and Nisan: 


4.5. Highlight: LMN’s Work on Constant-Depth Circuits 93 


LMN Theorem. Let f : {—1, 1} — {—1, 1} be computable by a depth-d 
circuit of size s > 1 and let € € (0,1/2]. Then f’s Fourier spectrum is 
€-concentrated up to degree O(log(s/€))4~! - log(1/e). 


Proof. If the circuit for f also had width at most w, we could deduce 3e- 
concentration up to degree 30w - (10 log(2s/¢€))“~? - log(2/e) by combining 
Lemma 4.28 with Lemma 4.21. But if we simply delete all layer-1 nodes of 
width at most log(s/e), the resulting circuit computes a function which is 
e-close to f, as in the proof of Proposition 4.9. Thus (using Exercise 3.17) f’s 
spectrum is O(€)-concentrated up to degree O(log(2s Jei! - log(2/€), and 
the result follows by adjusting constants. 


Remark 4.29. Håstad (Hastad, 2001a) has slightly sharpened the degree in the 
LMN Theorem to O(log(s /€))¢~? - log(s) - log(1/e). 


In Exercise 4.20 you are asked to use a simpler version of this proof, along 
the lines of Theorem 4.20, to show the following: 


Theorem 4.30. Let f : {—1, 1}” — {—1, 1} computable by a depth-d circuit 
of size s. Then I[ f] < O(log sj. 


These rather strong Fourier concentration results for constant-depth cir- 
cuits have several applications. By introducing the Low-Degree Algorithm for 
learning, Linial-Mansour—Nisan gave as their main application: 


Theorem 4.31. Let € be the class of functions f : {—1, 1}" —> {—1, 1} com- 
putable depth-d poly(n)-size circuits. Then € can be learned from random 
examples with error any € = 1/poly(n) in time nOllogny" 

In complexity theory the class of poly-size, constant-depth circuits is referred 
to as AC’. Thus the above theorem may be summarized as “AC? is learnable 
in quasipolynomial time”. In fact, under a strong enough assumption about 
the intractability of factoring certain integers, it is known that quasipolynomial 
time is required to learn AC? circuits, even with query access (Kharitonov, 
1993). 

The original motivation of the line of work leading to Hastad’s Switching 
Lemma was to show that the parity function x;,; cannot be computed in AC®. 
HAstad even showed that AC? cannot even approximately compute parity. We 
can derive this result from the LMN Theorem: 


Corollary 4.32. Fix any constant €y > 0. Suppose C is a depth-d circuit over 


{—1, 1}" with Pr,[C(x) = Xin (x)] = 1/2 + €o. Then the size of C is at least 
gana 
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Proof. The hypothesis on C implies Cc ([n]) > 2€o. The result then follows by 
taking € = 2e in the LMN Theorem. 


This corollary is close to being tight, since the parity x,,) can be computed by 
a depth-d circuit of size n2 for any d > 2; see Exercise 4.12. The simpler 
result Theorem 4.30 is often handier for showing that certain functions can’t 
be computed by AC? circuits. For example, we know that I[Maj,,] = O(./n); 
hence any constant-depth circuit computing Maj,, must have size at least Deere 

Finally, Linial, Mansour, and Nisan gave an application to cryptography. 
Informally, a function f : {—1, 1}” x {-1, 1}" — {-1, 1} is said to be a 
“pseudorandom function generator with seed length m” if, for any efficient 
algorithm A, 


Pr [A(f(s, -)) = “accept”] — Pr  [A(g) = “accept”]| < 1/n®™. 
s~{-1,1}" g~{—1,1} L9” 
Here the notation A(h) means that A has query access to target function h, and 
g ~ {—1, 1}{-'" means that g is a uniformly random n-bit function. In other 
words, for almost all “seeds” s the function f(s,-): {—1, 1}” — {-1, 1} is 
nearly indistinguishable (to efficient algorithms) from a truly random function. 
Theorem 4.30 shows that pseudorandom function generators cannot be com- 
puted by AC? circuits. To see this, consider the algorithm A(h) which chooses 
x ~ {-1, 1}” andi € [n] uniformly at random, queries h(x) and h(x®!), and 
accepts if these values are unequal. If h is a uniformly random function, A(h) 
will accept with probability 1/2. In general, A(h) accepts with probability 
I[h]/n. Thus Theorem 4.30 implies that if h is computable in AC? then A(h) 
accepts with probability at most polylog(n)/n « 1/2. 


4.6. Exercises and Notes 


4.1 Show that every function f : {0, 1}” — {0, 1} can be represented by a 
DNF formula of size at most 2” and width at most n. 


4.2 Suppose we have a certain CNF computing f : {0, 1}” — {0, 1}. Switch 
ANDs with ORs in the CNF. Show that the result is a DNF computing 
the Boolean dual ft : {0, 1}” — {0, 1}. 

4.3 A DNF formula is said to be monotone if its terms contain only unnegated 
variables. Show that monotone DNFs compute monotone functions and 
that any monotone function can be computed by a monotone DNF, but 
that a nonmonotone DNF may compute a monotone function. 
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4.4 Let f : {-1, 1” > {-1, 1} be computable by a DNF of size s. 

(a) Show there exists S$ C [n] with |S| < log(s) + OC) and | f(S)| > 
Q(1/s). (Hint: Use Proposition 4.9 and Exercise 3.30.) 

(b) Let € be the concept class of functions : {—1, 1}” —> {—1, 1} com- 
putable by DNF formulas of size at most s. Show that @ is learnable 
using queries with error 5 — Q(1/s) in time poly(n, s). (Such a result, 
with error bounded away from 4, is called weak learning.) 

4.5 Verify Proposition 4.12. 
4.6 Verify Proposition 4.14. 


4.7 For each n that is an input length for Tribes, show that there exists a 
function f : {—1, 1}” — {—1, 1} that is truly unbiased (E[ f] = 0) and 
has Inf;[f] < O("2") for all i € [n]. 

4.8 Suppose f :{—1, 1}" — {-1, 1} is computed by a read-once DNF 
(meaning no variable is involved in more than one term) in which 
all terms have width exactly w. Compute |j fj; exactly. Deduce that 
Tribes, fj, = 2 ven E0) and that there are n-variable width-2 DNFs with 
Fourier 1-norm Q(./3/2’). 

4.9 Give a direct (Fourier-free) proof of Corollary 4.18. (Hint: Condition on 
whether i € J.) 

4.10 Tighten the constant factor on logs in Theorem 4.20 as much as you 
can (avenues of improvement include the argument in Lemma 4.19, the 
choice of 6, and Exercise 4.17). 

4.11 Prove Lemma 4.23. 

4.12 (a) Show that the parity function x;,) : {—1, 1}” — {—1, 1} can be com- 

puted by a DNF (or a CNF) of size 2”~!. 

(b) Show that the bound 2”~! above is exactly tight. (Hint: Show that 
every term must have width exactly n.) 

(c) Show that there is a depth-3 circuit of size O(n”). qne comput- 
ing Xin]. (Hint: Break up the input into n'/? blocks of size n!/? 
use (a) twice. How can you compress the result from depth 4 to 
depth 3?) 


(d) More generally, show there is a depth-d circuit of size O(n!7 1/40) . 
an! Md—1y 


and 


computing X{nj- 

4.13 In this exercise we define the most standard class of Boolean circuits. A 
(De Morgan) circuit C over Boolean variables x;,..., x, is a directed 
acyclic graph in which each node (“gate”) is labeled with either an x; or 
with A, V, or — (logical NOT). Each x; is used as label exactly once; 
the associated nodes are called “input” gates and must have in-degree 0. 
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Each A and V node must have in-degree 2, and each — node must have 
in-degree 1. Each node “computes” a Boolean function of the inputs as 
in Definition 4.26. Finally, one node of C is designated as the “output” 
gate, and C itself is said to compute the function computed by the output 
node. For this type of circuit we define its size, denoted size(C), to be the 
number of nodes. 

Show that each of the following n-input functions can be computed by 
De Morgan circuits of size O(n): 

(a) The logical AND function. 

(b) The parity function. 

(c) The complete quadratic function from Exercise 1.1. 

Show that computing Tribes, , by a CNF formula requires size at least w*. 

Show that there is a universal constant €g > 0 such that the follow- 

ing holds: Every ån-junta g : {—1, 1}” > {—1, 1} is €o-far from Tribes, 

(assuming n > 1). (Hint: Letting J denote the coordinates on which g 

depends, show that if J has non-full intersection with at least 1 of the 

tribes/terms then when x ~ {—1, 1}’, there is a constant chance that 

Var fix] > 2(1).) 

Using the KKL Theorem, show that if f : {—1, 1}" — {-1, 1} is 

a transitive-symmetric function with Var[f] > QQ), then I[f] > 

Q(logn). 

Let f : {True, False}” — {True, False} be computable by a CNF C of 

width w. In this exercise you will show that I[ f] < w. 

Consider the following randomized algorithm that tries to produce an 
input x € f~'(True). First, choose a random permutation m € S,. Then 
fori = 1,...,n: If the single-literal clause x,(;) appears in C, then set 
Xai) = True, syntactically simplify C under this setting, and say that 
coordinate (i) is “forced”. Similarly, if the single-literal clause ¥,(;) 
appears in C, then set x,(;) = False, syntactically simplify C, and say that 
x (i) is “forced”. If neither holds, set x,(;, uniformly at random. If C ever 
contains two single-literal clauses x; and xj, the algorithm “gives up” 
and outputs x = L. 

(a) Show that if x ~ L, then f(x) = True. 

(b) For x € f~'(True) let p(x) = Pr[x = x]. For j € [n] let I; be the 
indicator random variable for the event that coordinate j € [n] is 
forced. Show that p(x) = EIT- 0/21]. 

(c) Deduce 2” p(x) > 25 E[I;]. 

(d) Show that for every x with f(x) = True, f(x®/) = False it holds that 
E[I; |x =x] > 1/w. 

(e) Deduce I[ f] < w. 


4.18 


4.19 
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Given Boolean variables x1,..., Xn, a “random monotone term of 

width w € Nt” is defined to be the logical AND of Xi,,+--,Xi,,, where 

i,,...,% are chosen independently and uniformly at random from [n]. 

(If the i;’s are not all distinct then the resulting term will in fact have 

width strictly less than w.) A “random monotone DNF of width w and 

size s” is defined to be the logical OR of s independent random monotone 
terms. For this exercise we assume n is a sufficiently large perfect square, 

and we let g be a random monotone DNF of width ./7 and size 2”. 

(a) Fix an input x e{—1,1}” and define u = On xi)/vn E€ 
[—J/n, yn]. Let T; be the event that the jth term of g is made 1 
(logical False) by x. Compute Pr[T ;] and Pr[g(x) = 1], and show 
that the latter is at least 107° assuming |u| < 2. 

(b) Let U ; be the event that the jth term of g has exactly one 1 on input x. 
Show that Pr[U ; | Vj] > Q(w2~”) assuming |u| < 2. 

(c) Suppose we condition on g(x) = 1; i.e., U;V;. Argue that the 
events U ; are independent. Further, argue that for the U ;’s that do 
occur, the indices of their uniquely-1 variables are independent and 
uniformly random among the 1’s of x. 

(d) Show that Pr[sensg(x) > c/n | p(x) = 1] > 1- 107! forc > Oa 
sufficiently small constant. 

(e) Show that Pr, [|("_, x/)/Val < 2] > Q1). 

(f) Deduce that there exists a monotone function f : {—1, 1}” > {—1, 1} 
with the property that Pr, [sens s(x) > c’ s/n] = c' for some universal 
constant c’ > 0. 

(g) Both Maj, and the function f from the previous exercise have average 
sensitivity ©(./n). Contrast the “way” in which this occurs for the 
two functions. 

In this exercise you will prove the Baby Switching Lemma with constant 3 

in place of 5. Let @ = Ti V TI, V--- V T, be a DNF of width w > 1 over 

variables x1, . . . , Xn. We may assume ô < 1/3, else the theorem is trivial. 

(a) Suppose R = (J | z) is a “bad” restriction, meaning that Øj; is not 
a constant function. Let i be minimal such that (7;);), is neither 
constantly True or False, and let j be minimal such that x; or xj 
appears in this restricted term. Show there is a unique restriction 
R' = (J \ {j} | z^) extending R that doesn’t falsify T;. 

(b) Suppose we enumerate all bad restrictions R, and for each we write 
the associated R’ as in (a). Show that no restriction is written more 
than w times. 

(c) If (J | z) is a 6-random restriction and R and R’ are as in (a), show 


that Pr[(J | z) = R] = 74 Pr[(J | z) = R']. 
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(d) Complete the proof by showing Pr[(J | z) is bad] < 36w. 


4.20 In this exercise you will prove Theorem 4.30. Say that a “(d, w, s’)- 
circuit” is a depth-d circuit with width at most w and with at most s’ 
nodes at layers 2 through d (i.e., excluding layers O and 1). 

(a) Show by induction on d > 2 that any f : {—1, 1}” — {-1, 1} com- 
putable by a (d, w, s’)-circuit satisfies I[ f] < wO dogs’). 
(b) Deduce Theorem 4.30. 


Notes 


Mansour’s Conjecture dates from 1994 (Mansour, 1994). Even the weaker version 
would imply that the Kushilevitz—Mansour algorithm learns the class of poly(1)-size 
DNF with any constant error, using queries, in time poly(). In fact, this learning result 
was subsequently obtained in a celebrated work of Jackson (Jackson, 1997), using a 
different method (which begins with Exercise 4.4). Nevertheless, the Mansour Conjec- 
ture remains important for learning theory since Gopalan, Kalai, and Klivans (Gopalan 
et al., 2008) have shown that it implies the same learning result in the more challenging 
and realistic model of “agnostic learning”. Theorems 4.24 and 4.25 are also due to 
Mansour (Mansour, 1995). 

The method of random restrictions dates back to Subbotovskaya (Subbotovskaya, 
1961). Hastad’s Switching Lemma (Hastad, 1987) and his Lemma 4.28 are the culmi- 
nation of a line of work due to Furst, Saxe, and Sipser (Furst et al., 1984), Ajtai (Ajtai, 
1983), and Yao (Yao, 1985). Linial, Mansour, and Nisan (Linial et al., 1989, 1993) 
proved Lemma 4.21, which allowed them to deduce the LMN Theorem and its con- 
sequences. An additional cryptographic application of the LMN Theorem is found 
in Goldmann and Russell (Goldmann and Russell, 2000). The strongest lower bound 
currently known for approximately computing parity in AC? is due to Impagliazzo, 
Matthews, and Paturi (Impagliazzo et al., 2012) and independently to Hastad (Hastad, 
2012). 

Theorem 4.20 and its generalization Theorem 4.30 are due to Boppana (Boppana, 
1997); Linial, Mansour, and Nisan had given the weaker bound O(log s)*. Exercise 4.17 
is due to Amano (Amano, 2011), and Exercise 4.18 is due to Talagrand (Talagrand, 
1996). 


5 
Majority and Threshold Functions 


This chapter is devoted to linear threshold functions, their generalization to 
higher degrees, and their exemplar the majority function. The study of LTFs 
leads naturally to the introduction of the Central Limit Theorem and Gaussian 
random variables — important tools in analysis of Boolean functions. We will 
first use these tools to analyze the Fourier spectrum of the Maj, function, 
which in some sense “converges” as n — oo. We’ll then extend to analyzing 
the degree-1 Fourier weight, noise stability, and total influence of general linear 
threshold functions. 


5.1. Linear Threshold Functions and Polynomial Threshold Functions 


Recall from Chapter 2.1 that a linear threshold function (abbreviated LTF) is a 
Boolean-valued function f : {—1, 1}” — {—1, 1} that can be represented as 


f(x) = sgn(do + aixi +++ + anxn) (5.1) 


for some constants ag, 41, ..., an € R. (For definiteness we’ ll take sgn(0) = 1. 
If we’re using the representation f : {—1, 1}” — {0, 1}, then f is an LTF if it 
can be represented as f(x) = Ltay+a;x,+---+a,x,>0}-) Examples include majority, 
AND, OR, dictators, and decision lists (Exercise 3.23). Besides representing 
“weighted majority” voting schemes, LTFs play an important role in learning 
theory and in circuit complexity. 

There is also a geometric perspective on LTFs. Writing (x) = ao + axı + 
+++ + anXn, we can think of £ as an affine function R” —> R. Then sgn(€(x)) is 
the +1-indicator of a halfspace in R”. A Boolean LTF is thus the restriction 
of such a halfspace-indicator to the discrete cube {—1, 1}” C R”. Equivalently, 
a function f : {—1, 1}” — {-1, 1} is an LTF if and only if it has a “linear 
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separator”; i.e., a hyperplane in R” that separates the points f labels 1 from the 
points f labels —1. 

An LIF f : {-1, 1}” — {—1, 1} can have several different representations 
as in (5.1) —in fact it always has infinitely many. This is clear from the geometric 
viewpoint; any small enough perturbation to a linear separator will not change 
the way it partitions the discrete cube. Because we can make these perturbations, 
we may ensure that ao + a)x; + +--+ anXn Æ 0 for every x € {—1, 1}”. We’ll 
usually insist that LTF representations have this property so that the nuisance 
of sgn(0) doesn’t arise. We also observe that we can scale all of the coefficients 
in an LIF representation by the same positive constant without changing the 
LTF. These observations can be used to show it’s always possible to take the 
a;’s to be integers (Exercise 5.1). However, we will most often scale so that 
Sam a? = 1; this is convenient when using the Central Limit Theorem. 

The most elegant result connecting LTFs and Fourier expansions is Chow’s 
Theorem, which says that a Boolean LTF is completely determined by its 
degree-0 and degree-1 Fourier coefficients. In fact, it’s determined not just 
within the class of LTFs but within the class of all Boolean functions: 


Theorem 5.1. Let f : {—1, 1}" — {-1, 1} be an LTF and let g : {—1, 1}" > 
{—1, 1} be any function. If g(S) = FCS) for all IS] < 1, then g = f. 


Proof. Let f(x) = sgn(€(x)), where £: {—1, 1}” —> R has degree at most 1 
and is never 0 on {—1, 1}”. For any x € {—1, 1}” we have f(x)€(x) = (x)| > 
g(x)€(x), with equality if and only if f(x) = g(x) (here we use (x) Æ 0). 
Using this observation along with Plancherel’s Theorem (twice) we have 


D F(SMS) = ELFU) = ElL = Y ANS). 


|S|<1 |S|<1 


But by assumption, the left-hand and right-hand sides above are equal. Thus the 
inequality must be an equality for every value of x; i.e., f(x) = g(x) Vx. 


In light of Chow’s Theorem, the n + 1 numbers Z), 2({1}), ..., @({n}) are 
sometimes called the Chow parameters of the Boolean function g. 

As we will show in Section 5.5, linear threshold functions are very noise- 
stable; hence they have a lot of their Fourier weight at low degrees. Here is a 
simple result along these lines: 


Theorem 5.2. Let f : {—1, 1}" > {—1, 1} be an LTE Then W='[f] > 1/2. 
Proof. Writing f(x) = sgn(€(x)) we have 


lly = EWON = (f, &) = (f5, 0 SFA lel = VWS] Iel, 
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where the third equality follows from Plancherel and the inequality is Cauchy— 
Schwarz. Assume first that £(x) = axı +--+ + anXn (i.e., L(x) has no constant 
term). The Khintchine—Kahane Inequality (Exercise 2.55) states that ||€||; > 
T \|€|2, and hence we deduce 


glt < VWSL] Welle. 


The conclusion W=![ fl = 1/2 follows immediately (since ||@||2 cannot be 0). 
The case when £(x) has a constant term is handled in Exercise 5.5. 


From Exercise 2.22 we know that W<! [Maj,,] = w! [Maj,,] > 2/z for all n; 
it is reasonable to conjecture that majority is extremal for Theorem 5.2. This is 
an open problem. 


Conjecture 5.3. Let f : {—1, 1}" > {—1, 1} be an LTF. Then W='[f] > 2/z. 


A natural generalization of linear threshold functions is polynomial threshold 
functions: 


Definition 5.4. A function f : {—1, 1}” —> {—1, 1} is called a polynomial 
threshold function (PTF) of degree at most k if it is expressible as f(x) = 
sgn(p(x)) for some real polynomial p : {—1, 1}” —> R of degree at most k. 


Example 5.5. Let f : {—1, 1}4 > {—1, 1} be the 4-bit equality function, which 
is 1 if and only if all input bits are equal. Then f is a degree-2 PTF because it 
has the representation f(x) = sgn(—3 + x1x2 + x1x3 + X1X4 + X2x3 + X2X4 + 
X3X4). 


Every Boolean function f : {—1, 1}” —> {—1, 1} is a PTF of degree at 
most n, since we can take the sign of its Fourier expansion. Thus we are 
usually interested in the case when the degree k is “small”, say, k = O,(1). 
Low-degree PTFs arise frequently in learning theory, for example, as hypothe- 
ses in the Low-Degree Algorithm and many other practical learning algorithms. 
Indeed, any function with low noise sensitivity is close to being a low-degree 
PTF; by combining Propositions 3.3 and 3.31 we immediately obtain: 


Proposition 5.6. Let f : {—1, 1}" —> {-1, 1} and let 5 € (0, 1/2]. Then f is 
(3NS;[f ])-close to a PTF of degree 1/6. 


For a kind of converse to this proposition, see Section 5.5. 
PTFs also arise in circuit complexity, wherein a PTF representation 


f(x) = sgn (è aa) 
i=l 
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is thought of as a “threshold-of-parities circuit”: i.e., a depth-2 circuit with s 
“parity gates” x” at layer 1 and a single “(linear) threshold gate” at layer 2. 
From this point of view, the size of the circuit corresponds to the sparsity of 
the PTF representation: 


Definition 5.7. We say a PTF representation f(x) = sgn(p(x)) has sparsity at 
most s if p(x) is a multilinear polynomial with at most s terms. 


For example, the PTF representation of the 4-bit equality function from Exam- 
ple 5.5 has sparsity 7. 

Let’s extend the two theorems about LTFs we proved above to the case of 
PTFs. The generalization of Chow’s Theorem is straightforward; its proof is 
left as Exercise 5.9: 


Theorem 5.8. Let f : {—1, 1}" —> {-1, 1} be a PTF of degree at most k and 
let g : {—1, 1}" —> {-1, 1} be any function. If g(S) = f(S) for all|S| < k, then 
8=f. 


We also have the following extension of Theorem 5.2: 


Theorem 5.9. Let f : {—1, 1}" > {—1, 1} be adegree-k PTF. Then W=*{ f] > 
eee 


Proof. Writing f(x) = sgn(p(x)) for p of degree k, we again have 
Ipli = EPN = (f, p) = Fp) < FM alle = VWS]: llelle- 


To complete the proof we need the fact that || p||2 < e*||p|l; for any degree-k 
polynomial p : {—1, 1}” — R. We will prove this much later in Theorem 9.22 


of Chapter 9 on hypercontractivity. 


The e~** in this theorem cannot be improved beyond 2!~*; see Exercise 5.11. 
We close this section by discussing PTF sparsity. We begin with a (simpler) 
variant of Theorem 5.9, which is useful for proving PTF sparsity lower bounds: 


Theorem 5.10. Let f : {—1, 1}" —> {-1, 1} be expressible as a PTF over the 
collection of monomials F C 2; i.e., f(x) = sgn(p(x)) for some polynomial 
P(X) = X seg P(S)x*. Then Y's. %|f(S)| = 1. 

Proof. Define g : {—1, 1}" > R by g(x) = Zeg f(S)x°%. Since ÎpÎa < 


| Pll; (Exercise 3.9) we have 


ÎpÎs < Ipli =ELf@)p@)] = D> FOS) 
Sin] 
= Y ROAS) < fghiipio. 
SEF 


and hence || gi > l as claimed. 
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We can use this result to show that the “inner product mod 2 function” (see 
Exercise 1.1) requires huge threshold-of-parities circuits: 


Corollary 5.11. Any PTF representation of the inner product mod 2 function 
IP>, : F?” — {-1, 1} has sparsity at least 2”. 


Proof. This follows immediately from Theorem 5.10 and the fact that 
[IP2,(S)| = 2~” for all S C [2n] (Exercise 1.1). 


We can also show that any function f : {—1, 1}” —> {—1, 1} with small 
Fourier 1-norm Î f fl, has a sparse PTF representation. In fact a stronger result 
holds: such a function can be additively approximated by a sparse polynomial: 


Theorem 5.12. Let f :{—1, 1}” ~ R be nonzero, let ô > 0, and let s > 
4nÎ fil; +182 be an integer. Then there is a multilinear polynomial q : 
{—1, 1}" = R of sparsity at most s such that || f — q|loo < ô. 


Proof. The proof is by the probabilistic method. Let T C [n] be randomly 
chosen according to the distribution Pr[T = T] = UO Let T Linda Lybe 


UEA) 
independent draws from this distribution and define the multilinear polynomial 


p(x) = X sgn( fT D)" 


i=1 


When x € {—1, 1}” is fixed, each monomial sen( f(T ;) xT becomes a +1- 
valued random variable with expectation 


fo. sen( f(T) x" = UT È AT) x" = fi 


TC[n] 


Thus by a Chernoff bound, for any € > 0, 


= f(x) A2 
T Pra [|p {| = es] < 2 exp( é s/2). 


Selecting € = ABAIR and using s > Ani fi, /82. the probability is at most 
2exp(—2n) < 2”. Taking a union bound over all 2” choices of x € {—1, 1}”, 
we conclude that there exists some p(x) = epee sgn(f(T;)) x” such that for 
all x € {-1, 1}”, 


p(x) — Ls <es=Xs => A -p(x)— f(x) < ô. 
fll Ifill È 


Îsi 
S 


Thus we may take q = 


- p. 


Corollary 5.13. Let f : {-1, 1}" — {-1, 1}. Then f is expressible as a PTF 
of sparsity at most s = [4n fî f i AL Indeed, f can be represented as a majority 
of s parities or negated-parities. 
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Proof. Apply the previous theorem with 6 = 1; we then have f(x) = sgn(q(x)). 
Since this is also equivalent to sgn(p(x)), the terms sgn(f(T;)) x" are the 
required parities/negated-parities. 


Though functions computable by small DNFs need not have small Fourier 
1-norm, it is a further easy corollary that they can be computed by sparse 
PTFs: see Exercise 5.13. We also remark that there is no good converse to 
Corollary 5.13: the Maj,, function has a PTF (indeed, an LTF) of sparsity n but 
has exponentially large Fourier 1-norm (Exercise 5.26). 


5.2. Majority, and the Central Limit Theorem 


Majority is one of the more important functions in Boolean analysis, and its 
study motivates the introduction of one of the more important tools: the Central 
Limit Theorem (CLT). In this section we will show how the CLT can be used to 
estimate the total influence and the noise stability of Maj,,. Though we already 
determined I[Maj,,] ~ /2/m/n in Exercise 2.22 using binomial coefficients 
and Stirling’s Formula, computations using the CLT are more flexible and 
extend to other linear threshold functions. 

We begin with a reminder about the CLT. Suppose X,, ..., X,, are indepen- 
dent random variables and S = X; + ---+ X,. Roughly speaking, the CLT 
says that so long as no X; is too dominant in terms of variance, the distribution 
of S is close to that of a Gaussian random variable with the same mean and 
variance. Recall: 


Notation 5.14. We write Z ~ N(0, 1) denote that Z is a standard Gaussian 
random variable. We use the notation 


o= onf dd, BH=o-N= f bod: 


for the pdf, cdf, and complementary cdf of this random variable. More gen- 
erally, if u € R? and £ € Rf% is a positive semidefinite matrix, we write 
Z ~ N(u, X) to denote that Z is a d-dimensional random vector with mean ju 
and covariance matrix ®©. 


We give a precise statement of the CLT below in the form of the Berry- 
Esseen Theorem. The CLT also extends to the multidimensional case (sums of 
independent random vectors); we give a precise statement in Exercise 5.33. In 
Chapter 11 we will show one way to prove such CLTs. 
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Let’s see how we can use the CLT to obtain the estimate I[Maj,] ~ 
/2/./n. Recall the proof of Theorem 2.33, which shows that Maj, maxi- 
mizes Yia f(i)among all f : {—1, 1}” — {—1, 1}. In it we saw that 


IMaj,] = J Maj, () = EM; OE x = EE 62 


i=1 


When using the CLT, it’s convenient to define majority (equivalently) as 
Maj, (x) = sen( Sexi), 
This motivates writing (5.2) as 


x~{-1,1 


I[Maj,]= vn: E wll De grili (5.3) 


If we introduce S = )7}_, Xi , then S has mean 0 and variance )>;(1/./n)? = 
1. Thus the CLT tells us that the distribution of S is close (for large n) to that 
of a standard Gaussian, Z ~ N(O, 1). So as n — oo we have 


CO 
= 2 1 -2J —_ -2/2 |= 
BISO ~, E flZ=2 f e poe? dz = -V3 |= J, 
(5.4) 


which when combined with (5.3) gives us the estimate I[Maj,,] ~ /2/7./n. 

To make this kind of estimate more precise we state the Berry—Esseen 
Theorem, which is a strong version of the CLT giving explicit error bounds 
rather than just limiting statements. 


Berry—Esseen (Central Limit) Theorem. Let X,,...,X, be independent 
random variables with E[X ;] = 0 and Var[X ;] = oè, and assume eee o? =i 
Let S = DM X; and let Z ~ N(O, 1) be a standard Gaussian. Then for all 
ueR, 

| Pr[S < u] — Pr[Z <u]| < cy, 


where 


y= IX: 
tel 


and c is a universal constant. (For definiteness, c = .56 is acceptable.) 
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Remark 5.15. If all of the X;’s satisfy |X;| < € with probability 1, then we 
can use the bound 


y=) EIX: P] < e- CEUX P] =e- 007 =e. 
i=l i=l i=l 
See Exercises 5.16 and 5.17 for some additional observations. 
Our most frequent use of the Berry—Esseen Theorem will be in analyzing 
random sums 


n 
S= X aiXi, 
i=1 


where x ~ {—1, 1}” andthe constants a; € B are normalized so that $`; a? =): 
For majority, all of the a;’s were equal to T But from Remark 5.15 we see 
that S is close in distribution to a standard Gaussian so long as each |a;| is 
small. For example, in Exercise 5.31 you are asked to show the following: 


Theorem 5.16. Let ai, ...,an € R satisfy X; a? = | and |a;| < € for all i. 
Then 


dail —/2/m| < Ce, 


MH = 1} 
where C is a universal constant. 


Theorem 5.16 justifies (5.4) with an error bound of O(1/,/n), yielding the 
more precise estimate I[Maj,,] = /2/2./n + O(1) (cf. Exercise 2.22, which 
gives an even better error bound). 

Now let’s turn to the noise stability of majority. Theorem 2.45 stated the 
formula 


lim Stab,[Maj,,] = 2 arcsin p = 1 — 2 arccos p. (5.5) 
n= 


Let’s now spend some time justifying this using the multidimensional CLT. 
(For complete details, see Exercise 5.33.) By definition, 


Stabp[Maj,] = E [Maj, (x) - Maj,(y)] = E [sgo Taxi) sen Rl: 
p- eee p-' aia i 


(5.6) 
For each i € [n] let’s stack J*i and F y; into a 2-dimensional vector and 
then write 


e k +y, 
s= Ei | cR. (5.7) 
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We are summing n independent random vectors, so the multidimensional CLT 
tells us that the distribution of S is close to that of a 2-dimensional Gaussian Z 
with the same mean and covariance matrix, namely (see Exercise 5.19) 


ZeD 


Stab,[Maj„] = E[sgn($1) - sgn(S2)] 
= Pr[sgn($1) = sgn($2)] — Pr[sga(51) 4 sgn(S2)] 
= 2Pr[sgn($,) = sgn($2)] — 1 = 4Pr[$ € O__]—1, 


Continuing from (5.6), 


where Q__ denotes the lower-left quadrant of R? and the last step uses the sym- 
metry Pr[S € Q4] = Pr[S € Q__]. Since Q__ is convex, the 2-dimensional 
CLT lets us deduce 


lim Pr[S € Q__]=Pr[Z € O__]. 
noo 
So to justify the noise stability formula (5.5) for majority, it remains to verify 


4Pr[Z Q_-)-l1=1 2 arccos p 


> 1 Larccos p 
<> Pr[ZeQ__j==--= 
222). oT 


And this in turn is a 19th-century identity known as Sheppard’s Formula: 


Sheppard’s Formula. Let zı, z2 be standard Gaussian random variables with 
correlation E[z,Z2] = p € [—1, 1]. Then 


1 _Larccos p 
Pr[z; <0,z2<0J= 5-5 . 
De 22 JE 


Proving Sheppard’s Formula is a nice exercise using the rotational symmetry of 
a pair of independent standard Gaussians; we defer the proof till Example 11.19 
in Chapter 11.1. This completes the justification of formula (5.5) for the limiting 
noise stability of majority. 

You may have noticed that once we applied the 2-dimensional CLT to (5.6), 
the remainder of the derivation had nothing to do with majority. In fact, the 
same analysis works for any linear threshold function sgn(a;x; +--+ + anXn), 
the only difference being the “error term” arising from the CLT. As in Theo- 
rem 5.16, this error is small so long as no coefficient a; is too dominant: 


Theorem 5.17. Let f : {—1, 1}" —> {-1,1} be an unbiased LTF, f(x) = 
sgn(a1xı +--+ + anXn) with ye a? =] and |a;| <€ for all i. Then for 
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any p E€ (—1, 1), 


Stab, [f] — 2 arcsin p| < 0(se). 


You are asked to prove Theorem 5.17 in Exercise 5.33. In the particular case 
of Maj, where a; = a for all i we can make a slightly stronger claim (see 


Exercise 5.23): 


Theorem 5.18. For any p € [0, 1), Stab, [Maj,,] is a decreasing function ofn, 
with 


2 


w 


. as qc 2 i ( e ) 
arcsin p < Stab, [Maj„] < = arcsin o + O Tokel 
We end this section by mentioning another way in which the majority 
function is extremal: among all unbiased functions with small influences, it has 
(essentially) the largest noise stability. 


Majority Is Stablest Theorem. Fix pọ € (0, 1). Then for any f : {—1, 1} > 
[—1, 1] with E[ f] = 0 and MaxInf[ f] < q, 

Stab,[f] < 2 arcsin p +0,;(1) = 1 — 2 arccos p + 0,(1). 
For sufficiently small p, we’ll prove this in Section 5.4. The proof of the full 
Majority Is Stablest Theorem will have to wait until Chapter 11. 


5.3. The Fourier Coefficients of Majority 


In this section we will analyze the Fourier coefficients of Maj,. In fact, we 
give an explicit formula for them in Theorem 5.19 below. But most of the 
time this formula is not too useful; instead, it’s better to understand the Fourier 
coefficients of Maj,, asymptotically as n —> oo. 

Let’s begin with a few basic observations. First, Maj, is a symmetric func- 
tion and hence Maj, (S) only depends on |S| (Exercise 1.30). Second, Maj, is 
an odd function and hence Maj, (S) = 0 whenever |S| is even (Exercise 1.8). It 
remains to determine the Fourier coefficients Maj, (S ) for |S| odd. By symme- 
try, Maj, (S)? = Wk [Maj,,]/(‘;) for all |S| = k, so if we are content to know the 
magnitudes of Maj,,’s Fourier coefficients, it suffices to determine the quanti- 
ties W* (Maj,,). 

In fact, for each k € N the quantity W*(Maj,,) converges to a fixed constant 
as n — oo. We can deduce this using our analysis of the noise stability of 
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majority. From the previous section we know that for all |p| < 1, 
jim Stab, [Maj,, ] = 2 arcsin p = 2(p + ip? + pe + pe’ + ), 
(5.8) 


where we have used the power series for arcsin, 


2 (k-1 
arcsin z = > a 1 ) 2%, (5.9) 


k odd 2 


valid for |o| < 1 (see Exercise 5.18). Comparing (5.8) with the formula 


Stab, [Maj,,] = Ý- W"[Maj, ] - 0! 
k>0 


suggests the following: For each fixed k € N, 


4 k- ; 

lim W*[Maj,] = [o“](2 arcsin p) = Hee (5.10) 

Are 0 if k even. 
(Here [z*]F(z) denotes the coefficient on z* in power series F(z).) Indeed, 
we prove this identity below in Theorem 5.22. The noise stability method that 
suggests it can also be made formal (Exercise 5.25). 

Identity (5.10) is one way to formulate precisely the statement that the 

“Fourier spectrum of Maj, converges”. Introducing notation such as ““W* (Maj)” 
for the quantity in (5.10), we have the further asymptotics 


fork odd,  W*(Maj) ~ (2)? k-3?, 
(5.11) 
W>* (Maj) ~ (2)? k" ask > o. 
(See Exercise 5.27.) The estimates (5.11), together with the precise value 
W! (Maj) = 2, are usually all you need to know about the Fourier coefficients 
of majority. 
Nevertheless, let’s now compute the Fourier coefficients of Maj, exactly. 


Theorem 5.19. If |S] is even, then Maj, (S) = 0. If |S| = k is odd, 


n-1 


2 

~ a (ea 
Maj, (S) = (1) 7 oy -A (2). 

(an) 

Proof. The first statement holds because Maj, is an odd function; henceforth 
we assume || = k is odd. The trick will be to compute the Fourier expansion of 
majority’s derivative D,Maj,, = Half,_1 : {—1, 1}"-! — {0, 1}, the 0-1 indi- 
cator of the set of (n — 1)-bit strings with exactly half of their coordinates 


n-1 
2 
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equal to to —1. By the | the derivative formula and the fact that Maj, is symmet- 
ric, Maj, (S) = Half, 1(T) for any T C [n — 1] with |T| = k — 1. So writing 
n — 1 = 2m and k — 1 = 2j, it suffices to show 


©) gH 
2m 
G) i 


By the probabilistic definition of T,, for any p € [—1, 1] we have 


Half, ((2/1) = (—D! (i): (5.12) 


T,Halfon(1, 1, BEA 1) = oe fae pL Haltom (1 


= Pr[x has m 1’s and m —1’s], 
where each coordinate of x is 1 with probability i + ło. Thus 


T,Halfom(1,1,...,1) = CG + hoy" — to" = eNA- ao. 
(5.13) 
On the other hand, by the Fourier formula for T, and the fact that Half2,, is 
symmetric we have 


2m 
T,Halfom(1, 1,...,1)= > Baay =S Caa D 
UC[2m] i=0 


(5.14) 
Since we have equality (5.13) = (5.14) between two degree-2m polynomials 
of p on all of [—1, 1], we can equate coefficients. In particular, for i = 2j we 
have 


("Halim (12/1) = ge 2") N — p?Y" = he") -(-1) ("), 
confirming (5.12). 


You are asked to prove the following corollaries in Exercises 5.20, 5.22: 


Corollary 5.20. Maj, (S) = Maj, (T) whenever |S| + |T| = n + 1. Hence also 


w+! [Maj] = + W* (Maj, ]. 


Corollary 5.21. For any odd k, W*{Maj, ] is a strictly decreasing function ofn 
(forn > k odd). 


We can now prove the identity (5.10): 
Theorem 5.22. For each fixed odd k, 


W‘ [Maj„] N [o*I(Ē arcsin p) = -4 (1!) 


E 
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as n > k tends to œ (through the odd numbers). Further, we have the error 
bound 


[p*](2 arcsin p) < W*[Maj,] < (1 + 2k/n) - [p*\(2 arcsin p) (5.15) 
for allk < n/2. (For k > n/2 you can use Corollary 5.20.) 


Proof. Corollary 5.21 tells us that W‘ [Maj,,] is decreasing in n; hence we only 
need to justify (5.15). Using the formula from Theorem 5.19 we have 


n n—1W\2 7254 2 yana 
wma OE D/C 
[o*](2 arcsin p) — E) 


x k—n (n—k l-n(n-1 
= gn "(ha)" (ta), 
where the second identity is verified by expanding all binomial coefficients to 


factorials. By Stirling’s approximation we have 2~” (m72) Z Z, meaning 


that the ratio of the left side to the right side increases to 1 as m > ow. 
Thus 


W*[Maj,, | n =(1 k+1 a k yo? 
[ok\(2 aresinp) © Jn—k/n—1 ee 


and the right-hand side is at most 1 + 2k/n for 1 < k < n/2 by Exercise 5.24. 


Finally, we can deduce the asymptotics (5.11) from this theorem (see Exer- 
cise 5.27): 


Corollary 5.23. Let k € N be odd and assume n = n(k) > 2k?. Then 


W'(Maj,) = (272 K2. + 00/5), 


W>*(Maj,) = (2V7 2-1 £ 0/5), 


and hence the Fourier spectrum of Maj, is €-concentrated on degree up to 
8 — 
Se + O-(1). 


5.4. Degree-1 Weight 


In this section we prove two theorems about the degree-1 Fourier weight of 
Boolean functions: 


W'A = >> FG. 
i=1 
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This important quantity can be given a combinatorial interpretation thanks to 
the noise stability formula Stab,[f] = eco pk WTF): 


d 
For f :{-1,)" > R, W'[f]= Tp SPL] 
p 
Thinking of || f ||2 as constant and p — 0, the noise stability formula implies 


Stab [f] = EL fI + W'Lfle + O(p”), 


or equivalently, 


Covi f(x), FO) = W'I flo + 00°. 


p-correlated 


In other words, for f :{—1, 1}” —> {—1, 1} the degree-1 weight quantifies 
the extent to which Pr[ f(x) = f(y)] increases when x and y go from being 
uncorrelated to being slightly correlated. 

There is an additional viewpoint if we think of f as the indicator of a subset 
A C {-1, 1}” and its noise sensitivity NS;[f] as a notion of A’s “surface area”, 
or “noisy boundary size”. For nearly maximal noise rates — i.e., ô = 5 — 5 p 
where p is small — we have that A’s noisy boundary size is “small” if and only 
if W![f] is “large” (vis-à-vis A’s measure). 

Two examples suggest themselves when thinking of subsets of the Hamming 
cube with small “boundary”: subcubes and Hamming balls. 


Proposition 5.24. Let f : F} — {0, 1} be the indicator of a subcube of codi- 
mension k > | (e.g., the AND; function). Then E[ f] = DE w'[f]= k27”. 


Proposition 5.25. Fixt € R. Consider the sequence of LTFs fa : {—1, 1}" > 
{0, 1} defined by fa(x) = 1 if and only if X; ani > t. (That is, f, is the 
indicator of the Hamming ball {x : A(x, (,..., D) < 7 — 5/n}.) Then 


lim Elfa] = 90), lim W'[ fn] = g0’. 


You are asked to verify these facts in Exercises 5.29, 5.30. Regarding Propo- 
sition 5.25, it’s natural for @(t) to arise since W'[ f} ] is related to the influences 
of fa, and coordinates are influential for fọ if and only if $`; pri~ t. If we 
write œ = lim, E[ fn] then this proposition can be thought of as saying that 
WIER] > Ula), where is defined as follows: 


Definition 5.26. The Gaussian isoperimetric function Y : [0, 1] > [0, zzl 


is defined by Y = ¢ o ®~!. This function is symmetric about 1/2; i.e., Y = 
ġo oe. 
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The name of this function will be explained when we study the Gaussian 
Isoperimetric Inequality in Chapter 11.4. For now we’ll just use the following 
fact: 


Proposition 5.27. For a —> 0*, Ua) ~ a/2In(/a). 


Proof. Writea = @(t), where t > oo. We use the well-known fact that B(t) ~ 
d(t)/t. Thus 


a~ = exp(-1?/2) = t~ V2In(1/a), 
HO ~A- => WUa)~a-t~a/2In(1/a). 


Given Propositions 5.24 and 5.25, let’s consider the degree-1 Fourier weight 
of subcubes and Hamming balls asymptotically as their “volume” œ = E[f] 
tends to 0. For the subcubes we have W![ f] = a” log(1/a). For the Hamming 
balls we have W'[ f,] > Ma)? ~ 20? In(1/a). So in both cases we have an 
upper bound of O(a? log(1/a)). 

You should think of this upper bound O(a? log(1/a)) as being unusually 
small. The obvious a priori upper bound, given that f : {—1, 1}” — {0, 1} has 
EL f] = a, is 


W'[f] < Vari f] = a(l — a) ~ a. 


Yet subcubes and Hamming balls have degree-1 weight which is almost quadrat- 
ically smaller. In fact the first theorem we will show in this section is the 
following: 


Level-1 Inequality. Let f : {—1, 1}" —> {0, 1} have mean E[ f] = a < 1/2. 
Then 


W'[f] < Oœ? log(1/a)). 
(For the case a > 1/2, replace f by 1 — f.) 
Thus all small subsets of {—1, 1}” have unusually small W!{[ f]; or equiva- 


lently (in some sense), unusually large “noisy boundary”. This is another key 
illustration of the idea that the Hamming cube is a “small-set expander”. 


Remark 5.28. The bound in the Level-1 Inequality has a sharp form, W![ f] < 
2a? In(1/a). Thus Hamming balls are in fact the “asymptotic maximizers” of 
W![f] among sets of small volume «. Also, the inequality holds more generally 
for f : {-1, 1}" > [-1, 1] witha = E|| f |]. 


Remark 5.29. The name “Level-1 Inequality” is not completely standard; e.g., 
in additive combinatorics the result would be called Chang’s Inequality. We 
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use this name because we will also generalize to “Level-k Inequalities” in 
Chapter 9.5. 


So far we considered maximizing degree-1 weight among subsets of the 
Hamming cube of a fixed small volume, a. The second theorem in this section 
is concerned with what happens when there is no volume constraint. In this case, 
maximizing examples tend to have volume a = 1/2; switching the notation to 
f :{-1, 1}} — {-1, 1}, this corresponds to f being unbiased (E[ f] = 0). The 
unbiased Hamming ball is Maj,,, which we know has W![Maj,,] > 2. This is 
quite large. But unbiased subcubes are just the dictators x; and their negations; 
these have W![+y;] = 1 which is obviously maximal. 

Thus the question of which f : {—1, 1}” > {—1, 1} maximizes W'[ f] has 
a trivial answer. But this answer is arguably unsatisfactory, since dictators (and 
their negations) are not “really” functions of n bits. Indeed, when we studied 
social choice in Chapter 2 we were motivated to rule out functions f having a 
coordinate with unfairly large influence. And in fact Proposition 2.58 showed 
that if all fli ) are equal (and hence small) then W![f] < 2 + on(1). The 
second theorem of this section significantly generalizes Proposition 2.58: 


The 2 Theorem. Let f : {—1, 1}” > {-1, 1} satisfy |f@| <€foralli € [n]. 
Then 


W'If] < 2+ 0(6). (5.16) 
Further, if W'[f] > 2 — e, then f is O(./€)-close to the LTF sgn( f=). 


Functions f with IFO < e for all i € [n] are called (e€, 1)-regular; see 
Chapter 6.1. So the 2 Theorem says (roughly speaking) that within the class 
of (e, 1)-regular functions, the maximal degree-1 weight is 2, and any function 
achieving this is an unbiased LTF. Further, from Theorem 5.17 we know that 
all unbiased LTFs which are (e€, 1)-regular achieve this. 


Remark 5.30. Since we have Stab, [ f] ~ W'[f]p and 2 arcsin p © 2p when 
p is small, the 2 Theorem gives the Majority Is Stablest Theorem in the limit 
p> or. 


Let’s now discuss how we'll prove our two theorems about degree-1 
weight. Let f : {-1, 1}” — {0,1} and a = E[f]; we think of œ as small 
for the Level-1 Inequality and æ = 1/2 for the 2 Theorem. By Plancherel, 
W![f] = ELf(x)L(x)], where 


L(x) = fox) = fay +--+ + fin. 
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To upper-bound E[ f(x)L(x)], consider that as x varies the real number L(x) 
may be rather large or small, but f(x) is always 0 or 1. Given that f(x) is 1 
on only aq fraction of x’s, the “worst case” for E[ f(x)L(x)] would be if f(x) 
were | precisely on the « fraction of x’s where L(x) is largest. In other words, 


WEF] = ELf@)L@)] < Ezy L), (5.17) 
where tf is chosen so that 
Pr[L(x) > t] a. (5.18) 


But now we can analyze (5.17) quite effectively using tools such as Hoeffding’s 
bound and the CLT, since L(x) is just a linear combination of independent +1 
random bits. In particular L(x) has mean 0 and standard deviation o = y W![f] 
so by the CLT it acts like the Gaussian Z ~ N(0, o°), at least if we assume all 
IFO are small. If we are thinking of œ = 1/2, then t = 0 and we get 


o? = W'[f] < Eflrwzo Lœ © Eldizz0) Z] = Feo: 


This implies o? = x as claimed in the 2 Theorem (after adjusting f’s range 
to {—1, 1}). If we are instead thinking of œ as small then (5.18) suggest taking 
t ~ o/2In(/a) so that Pr[Z > t] ~ a. Then a calculation akin to the one in 
Proposition 5.27 implies 


W'Lfl< E{lzysn > LA] © æ- oy2ln(1/æ), 


from which the Level-1 Inequality follows. In fact, we don’t even need all | fli )| 
small for this latter analysis; for large t it’s possible to upper-bound (5.17) using 
only Hoeffding’s bound: 


Lemma 5.31. Let €(x) = a,x, +--+ + anXn, where Jo a? = 1. Then for any 
s >l, 


Elliw; EON < (2s + 2) exp $). 


Proof. We have 


Ellies) : QO = s Prill) > s] +f Pr[|€(x)| > u] du 


S 


CO 
< 2s exp(-§) f 2exp(—) du, 


S 


using Hoeffding’s bound. But for s > 1, 


lo) 3 io) p ‘ 
2exp(—) du < u- 2exp(—7)du = 2exp(- 5). 


116 5 Majority and Threshold Functions 


We now give formal proofs of the two theorems, commenting that rather 
than L(x) it’s more convenient to work with 


L(x) = if = IDy +--+ x, 


Proof of the Level-I Inequality. Following Remark 5.28 we let 
f :{-1, 1}" > [-1, 1] and a = E[| f|]. We may assume o = ./W'[f] > 0. 
Writing £ = + f=! we have (f, £) = 4(f, f=!) = +W'Lf] = o and hence 


o = (f, £) = Edges) - FALE + Ellies FALO 


holds for any s > 1. The first expectation above is at most E[s| f(x)|] = as, and 
the second is at most (2 + 2s) exp(—s?/2) < 4s exp(—s?/2) by Lemma 5.31. 
Hence 


o <as+4s exp(—s?/2). 
The optimal choice of s is s = (/2 + 04(1))/Ind/a@), yielding 
o < (V2 + 0(1))aV/In(/a). 


Squaring this establishes the claim o? < (2 + 0,(1))a? In(1/a). 


Proof of the 2 Theorem. We may assume o = ,/W![f] > 1/2: for the theo- 
rem’s first statement this is because otherwise there is nothing to prove; for 
the theorem’s second statement this is because we may assume € sufficiently 
small. 

We start by proving (5.16). Let £ = ie so ||¢||2 = 1 and ROJ < 2e for 
alli € [n]. We have 


o = (f, £) < EIU] < y2 + Ce (5.19) 


for some constant C, where we used Theorem 5.16. Squaring this proves (5.16). 
We observe that (5.16) therefore holds even for f : {—1, 1}} —> [-1, 1]. 
Now suppose we also have W! [f] > 2 — €; i.e., 


o> ER > 2-2. 
Thus the first inequality in (5.19) must be close to tight; specifically, 
(C + 2)e > ELE] — (f, &) = El(sgn(€(x)) — f(x))- €(x)]. (5.20) 
By the Berry—Esseen Theorem (and Remark 5.15, Exercise 5.16), 
Pr[|¢| < K Je] < Pr[|N(0, 1)| < K Ve] + .56- 2€ 


Tm 2K /e+ 1.126 < 2K Je 


IA 
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for any constant K > 1. We therefore have the implication 


Pri f # sgn(0)] = 3K Ve 
= Pri f(x) # sga) A L| > KV] > K Ve 
=> El(sgn(€(x)) — f(x) €@)] = KVE- AKVE) = 2Ke. 


This contradicts (5.20) for K =/C+2, say. Thus Pr[f 4 sgn(€)] < 
3./C + 2,./e, completing the proof. 


For an interpolation between these two theorems, see Exercise 5.44. 
We conclude this section with an application of the Level-1 Inequality. First, 
a quick corollary which we leave for Exercise 5.37: 


Corollary 5.32. Let f : {—1, 1}" > {-1, 1} have |E[f]] > 1 — ô > 0. Then 
W![f] < 48? log(2/6). 


In Chapter 2.5 we stated the FKN Theorem, which says that if 
f :{-1, I)” > {-1, 1} has Ww![fl > 1] — ô then it must be O(6)-close to 
a dictator or negated-dictator. The following theorem shows that once the 
FKN Theorem is proved, it can be strengthened to give an essentially optimal 
(Exercise 5.36) closeness bound: 


Theorem 5.33. Suppose the FKN Theorem holds with closeness bound Cô, 
where C > 1 is a universal constant. Then in fact it holds with bound 6/4 + n, 
where n = 16C*5* max(log(1/C5), 1). 


Proof. Suppose f : {—1, 1}" > {—1, 1} has W'[ f] > 1 — ô > 0. By assump- 
tion f is Cé-close to +x; for some i € [n], say i = n. Thus we have 


IF| > 1- 2C8 


and our task is to show that in fact IF) > 1 — ô/2 — 2n. We may assume 
ô < T as otherwise 1 — 5/2 — 2n < 0 (Exercise 5.38) and there is nothing 
to prove. By employing the trick from Exercise 2.49 we may also assume 
E[f] = 0. 

Consider the restriction of f given by fixing coordinate n to b € {—1, 1}; 
Le., f(n—1)o- For both choices of b we have | E[ fin—1yo]| = 1 — 2C6 and so 


Corollary 5.32 implies W'[ fin-i] < 16C?5? log(1/ C8). Thus 


16C75? log(1/C8) > E[W'Lfin—noll = X FUD + Fi) 


j<n 


SFU) 


j<n 
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by Corollary 3.22. It follows that 


fay = W'tfl— So fU? = 1-5 — 16078? log(1/C8), 


j<n 
and the proof is completed by the fact that 


1 — ô — 16C75 log(1/C8) > (1 — 8/2 — 21}? 


when 6 < I (Exercise 5.38). 


5.5. Highlight: Peres’s Theorem and Uniform Noise Stability 


Theorem 5.17 says that if f is an unbiased linear threshold function f(x) = 
sgn(a1xı +--+ anXn) in which all a;’s are “small”, then the noise stability 
Stab,[ f] is at least (roughly) 2 arcsin p. Rephrasing in terms of noise sensi- 
tivity, this means NS;[f] is at most (roughly) 2/5 + O(5%/?) (see the state- 
ment of Theorem 2.45). On the other hand, if some a; were particularly large 
then f would be pushed in the direction of the dictator function x;, which has 
NS;[xi] = 6 « V5. This observation suggests that all unbiased LTFs f should 
have NS;[f] < O(V8). The unbiasedness assumption also seems inessential, 
since biasing a function should tend to decrease its noise sensitivity. 
Indeed, the idea here is correct, as was shown by Peres in 1999: 


Peres’s Theorem. Let f : {—1, 1}" —> {—1, 1} be any linear threshold func- 
tion. Then NSs[ f] < O(V/S). 


Pleasantly, the proof is quite simple and uses no heavy tools like the Central 
Limit Theorem. Before getting to it, let’s make some remarks. First, Peres’s 
Theorem shows that the class of all linear threshold functions is what’s called 
uniformly noise-stable. 


Definition 5.34. Let Z be a class of Boolean-valued functions. We say that 4 
is uniformly noise-stable if there exists € : [0, 1/2] — [0, 1] with €(5) > Oas 
ô — O* such that NSs[f] < €(6) holds for all f € 2. 


This definition is only interesting for infinite classes 2. (Any class con- 
taining functions of only finitely many input lengths is vacuously uniformly 
noise-stable; see Exercise 5.34.) By Proposition 5.6 we see that functions in a 
uniformly noise-stable class have “almost all of their Fourier weight at constant 
degree”; i.e., for alle > 0 there isak € N such that W>*[f] < € forall f € Z. 
In particular, from Corollary 3.34 we get that if 2 is a uniformly noise-stable 
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class then its restriction to n-input functions is learnable from random examples 
to any constant error in poly(7) time. 

Let’s make these observations more concrete in the context of linear thresh- 
old functions. Peres’s Theorem immediately gives that LTFs have their Fourier 
spectrum €-concentrated up to degree O(1/e€7) (Proposition 3.3) and hence the 
class of LTFs is learnable from random examples with error € in time n?0/® 
(Corollary 3.34). The latter result is not too impressive since it’s been long 
known that LTFs are learnable in time poly(n, 1/¢) using linear programming. 
However, the noise sensitivity approach is much more flexible. Consider the 
concept class 


€={th=e(hfi,.--, fe) | fo fe :{-1, 1} > {-1, 1} are LTFs}. 


For each h : {—1, 1}” — {-1, 1} in @, Peres’s Theorem and a union bound 
(Exercise 2.44) imply that NS3[/] < O(svV/8). Thus from Corollary 3.34 we 
get that the class @ is learnable in time n?’/©). This is the only known way 
of showing even that an AND of two LTFs is learnable with error .01 in time 
poly(n). 

The trick for proving Peres’s Theorem is to employ a fairly general technique 
for bounding noise sensitivity using average sensitivity (total influence): 


Theorem 5.35. Let 5 € (0, 1/2] and let A: N* > R. Let B be a class of 
Boolean-valued functions closed under negation and identification of input 
variables. Suppose that each f € B with domain {—1, 1}" has I[ f] < A(n). 
Then each f € B has NS;[ f] < + A(m), where m = |1/6]. 


Proof. Fix any f : {—1, 1}” > {—1, 1} from &%. Since noise sensitivity is an 
increasing function of the noise parameter (see the discussion surrounding 
Proposition 2.51) we may replace 6 by 1/m. Thus our task is to upper-bound 
NSi ml f] = Prt f(x) Æ f(y)] where x ~ {—1, 1}” is uniformly random and 
y € {—1, 1}” is formed from x by negating each bit independently with prob- 
ability 1/m. The rough idea of the proof is that this is equivalent to randomly 
partitioning x’s bits into m parts and then negating a randomly chosen part. 

More precisely, let z € {—1, 1}” and let x : [n] —> [m] be a partition of [n] 
into m parts. Define 


80 {-1, Y" > {-L 1}, g22(w) = f(z o w”), 


where o denotes entry-wise multiplication and w7” = (wy1),..., Wan) € 
{—1, 1}". Since gz x is derived from f by negating and identifying input 
variables it follows that g,, € 2. So by assumption gz „ has total influ- 
ence I[g,,,] < A(m) and hence average influence [g n] < + A(m) (see Exer- 
cise 2.43(a)). 
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Now suppose z ~ {—1, 1}” and x : [n] — [m] are chosen uniformly at 
random. We certainly have 


E [6lgzx1] < A(n). 


Z, 


To complete the proof we will show that the left-hand side above is precisely 
NSi/m[ f]. Recall that in the experiment for average influence [g] we choose 
w ~ {—1, 1}” and j ~ [m] uniformly at random and check if g(w) 4 e(w®/), 
Thus 


Elélgz.x]] Pr joer ™) # 82.n(we)] 


2,0, 


„PE [SGo w # feo wh]. 


It is not hard to see that the joint distribution of z o w”, z o (w®/)* is the same 
as that of x, y. To be precise, define J = x~'(j), distributed as a random 
subset of [n] in which each coordinate is included with probability 1/m, and 
define A € {—1, 1}” by à; = —1 if and only if i € J. Then 


„PE [FG ow") z fao w= Pr [Eo wA fow" od). 


But for every outcome of w, x, j (and hence J, à), we may replace z with 
zo w” since they have the same distribution, namely uniform on {—1, 1}”. 
Then the above becomes 


PE SO # fE WIS NSl f], 


as claimed. 


Peres’s Theorem is now a simple corollary of Theorem 5.35. 


Proof of Peres’s Theorem. Let & be the class of all linear threshold functions. 
This class is indeed closed under negating and identifying variables. Since each 
linear threshold function on m bits is unate (i.e., monotone up to negation of 
some input coordinates, see Exercises 2.5, 2.6), its total influence is at most ./m 
(see Exercise 2.23). Applying Theorem 5.35 we get that for any LTF f and any 
ô € (0, 1/2], 


NS[f] < pVm = 1/Vm (form = [1/8]) 
< O(V5). 


Remark 5.36. Our proof of Peres’s Theorem attains the upper bound 


J1/[1/6]. This is at most /3/2/68 for all 5 € (0, 1/2] and it’s also 
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ô + O(6?/) for small ô. To further improve the constant we can use Theo- 
rem 2.33 in place of Exercise 2.23; it implies that all unate m-bit functions have 
total influence at most ./2/z./m + O(m7'/"). This lets us obtain the bound 
NSs[f] < /2/aV65 + O(5°/) for all LTF f. 


Recall from Theorem 2.45 that NS;[Maj,,] ~ 2/65 for large n. Thus the 
constant ./2/z in the bound from Remark 5.36 is fairly close to optimal. 
It seems quite likely that majority’s 2 is the correct constant here. There 
is still slack in Peres’s proof because the random functions g,, arising in 
Theorem 5.35 are unlikely to be majorities, even if f is. The most elegant 
possible result in this direction would be to prove the following conjecture of 


Benjamini, Kalai, and Schramm: 


Majority Is Least Stable Conjecture. Let f : {—1, 1}” —> {-1, 1} bea linear 
threshold function, n odd. Then for all p € [0, 1], Stab,[f] > Stab, [Maj, ]. 


(This is a precise statement about majority’s noise stability within the class of 
LTFs; the Majority Is Stablest Theorem refers to its noise stability within the 
class of small-influence functions.) 

A challenging problem in this area is to extend Peres’s Theorem to polyno- 
mial threshold functions. Let 


P, k = {f : {—1, 1}” > {-1, 1} | f is a PTF of degree at most k}, 
Py = |] Pax. 


Peres’s Theorem shows that the class Y; (i.e., LTFs) is uniformly noise-stable. 
Is the same true of A? What about #199? More quantitatively, what upper 
bound can we prove on NS;[ f] for f € A? Since % is closed under negating 
and identifying variables, a natural approach to bounding the noise sensitivity 
of PTFs is to again use Theorem 5.35. For example, if we could show that 
I[f] = o(n) for all f € A we could conclude that NS;[f] = os(1) for all 
f € Py, i.e., that Z, is uniformly noise-stable. (In fact, the total influence 
approach to bounding noise sensitivity is not just sufficient but is also necessary; 
see Exercise 5.40.) More ambitiously, if we could show that I[ f] < O,(1)./n 
for all f € H, then it would follow that NS;[f] < On) V5 forall f E PZ, 
strictly generalizing Peres’s Theorem. In fact, a conjecture of Gotsman and 
Linial dating back to 1990 proposes an even more refined bound: 


Gotsman-Linial Conjecture. Let f € P,. Then I[ f] < Og(1)./n. More 
strongly, I[f] < O(k)./n. Most strongly, the f € Z, of maximal total 
influence is the symmetric one f(x) = sgn(p(x1 +--+ xn)), where p is a 
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degree-k univariate polynomial which alternates sign on the k + 1 values of 
Xj +- + xn closest to 0. 


The strongest form of the Gotsman—Linial Conjecture is true when k = 1, by 
Theorem 2.33. However, even for k = 2 there was no progress on the conjecture 
for close to 20 years. At that point two independent works (Diakonikolas 
et al., 2010; Harsha et al., 2010) showed that every f € Y,,% satisfies both 
If] < O(n!7"%*) and I[ f] < 2°n!-'/°. The former (essentially weaker) 
bound has the advantage of an elementary proof; see Exercise 5.45. It also 
suffices to show that A, the class of degree-k PTFs, is indeed uniformly noise- 
stable. This gives a nice kind of converse to Proposition 5.6, which showed that 
every function in a uniformly noise-stable class is close to being a constant- 
degree PTF. 

The latest progress on the Gotsman—Linial Conjecture is the following the- 
orem of Kane (Kane, 2012), which comes quite close to proving it: 


Theorem 5.37. Every f € P, x satisfies IL f] < Jn - (2 logn)?*'®®, It fol- 
lows (via Theorem 5.35) that for a fixed k e N*, every f € ® satisfies 
NSs[f] < V3 - polylog(1/5). 


5.6. Exercises and Notes 


5.1 (a) Suppose f : {—1, 1}” — {—1, 1} is an LTF. Show that it can be 
expressed as f(x) = sgn(dp + a1)xX1 + :--anXn) where the a;’s are 
integers. (Hint: First obtain rational a;’s by a perturbation.) 

(b) Show also that a degree-d PTF has a representation in which all of 
the degree-d polynomial’s coefficients are integers. 

5.2 Let f(x) = sgn(ao + aX; +--+ anXn) be an LTF. 

(a) Show that if ag = 0, then E[ f] = 0. (Hint: Show that f is in fact an 
odd function.) 

(b) Show that if ag > 0, then E[ f] > 0. Show that the converse need not 
hold. 

(c) Suppose g : {—1, 1}" > {-1, 1} is an LTF with E[ f] = 0. Show that 
g can be represented as g(x) = sgn(c1x1 +--+ + CnXn). 

5.3 Suppose f(x) = sgn(aọ + axı + +++ anxn) is an LTF with |a;| > |az| > 
+++ > |a,|. Show that Infi [f] > Inf2[f] > --- > Inf [f]. (Hint: Why 
does it suffice to prove this for n = 2?) 

5.4 (a) Show that the number of functions f : {—1, 1}” —> {—1, 1} that are 

LTFs is at most 2” +°™, (Hint: Chow’s Theorem.) 
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(b) More generally, show that the number of functions f : {—1, 1}” > 
{—1, 1} that are degree-k PTFs is at most 2”"'+0™, 
5.5 (a) Suppose £: a 1, 1}" — R is defined by (x) = ap + axı +--+ + 
anXn. Define £ : {—1, ve > Rby exo, wees Xn) = AX + ayx, + 
` + ayX,. Show that |||, = Il£ll and 123 = 1£l3. 
(b) Eala the proof of Theorem 5.2. 
5.6 Let f :{—1, 1}” — {—1, 1} be an unbiased linear threshold function. 
Show that Inf; [f] > Tr for some i € [n], improving the KKL Theorem 
for LTFs. 


5.7 Consider the following “correlation distillation” problem (cf. Exer- 
cise 2.56). For each i € [n] there is a number p; € [—1, 1] and an inde- 
pendent D of pairs of p;-correlated bits, (a\”, bs), (a, @) bY), 
(a, @) BS ), etc. Party A on Earth has access to the stream of bits a”, a > 
a®,. . and a party B on Venus has access to the stream bP, b”, bO, R 
Neither party knows the numbers p1, . . . , Pn. The goal is for B to estimate 
these correlations. To assist in this, A can send a small number of bits to B. 
A reasonable strategy is for A to send f(a), f(a®), f(a®), ...to B, 
where f : {—1, 1}” — {—1, 1} is some Boolean function. Using this 
information B can try to estimate E[ f(a)b;] for each i. 

(a) Show that E[ f(a)b;] = f(i)p;. 

(b) This motivates choosing an f for which all fii) are large. If we also 
insist all fii) be equal, show that majority functions f maximize this 
common value. 

5.8 Forn > 2, let f : {—1, 1}” — {-1, 1} be a randomly chosen function (as 
in Exercise 1.7). Show that Î f IE < 2/n2™™/? except with probability 
at most 27”. 

5.9 Prove Theorem 5.8. 

5.10 (a) Give as simple a proof as you can that the parity function x{n} : 
{—1, 1}" —> {-1, 1} is not a PTF of degree n — 1. 

(b) Show that if f : {—1, 1}” > {-1, 1} is not £y,,, then it is a PTF of 
degree n — 1. (Hint: Consider f=""!.) 

5.11 For each k € Nt, show that there is a degree-k PTF f with W="[ f] < 

Dhak 

5.12 In this exercise you will show that threshold-of-parities circuits can be 
effectively simulated by threshold-of-threshold circuits, but not the con- 
verse. 

(a) Let f : {—1, 1}” — {-1, 1} be asymmetric function. Show that f is 
computable as the sum of at most 2n LTFs, plus a constant. 
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5.14 


5.16 
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(b) Deduce that if f : {—1, 1}” — {—1, 1} is computable by a size-s 
threshold-of-parities circuit, then it is also computable by a size-2ns 
threshold-of-thresholds circuit. 

(c) Show that the complete quadratic function CQ, : F5 > {—1, 1} (see 
Exercise 1.1) is computable by a size-2n threshold-of-thresholds 
circuit. 

(d) Assume n even. Show that any threshold-of-parities circuit for CQ, 
requires size 2”/?. 

Let f : {—1, 1} — {—1, 1} be computable by a DNF of size s. Show 

that f has a PTF representation of sparsity O(ns*). (Hint: Approximate 

the ANDs using Theorem 5.12.) 

In contrast to the previous exercise, show that there is a function 

f : {-1, 1)" > {-1, 1} computable by a depth-3 AC? circuit (see Chap- 

ter 4.5) but requiring threshold-of-parities circuits of size at least n'°2”. 

(Hint: Involve the inner product mod 2 function and Exercise 4.12.) 

Let F be a nonempty collection of subsets S$ C [n]. For each a € 

{—1, 1}", write lia; : {-1, 1}” —> {0, 1} for the indicator {a}, write 

17: {-1,1}" > R for Eseg lig iS) xs. and write Ya = a 


{a} {a} 


(a) Show that w,(a) = 1 and E[y?] = = zi Show also that for all x € 
{—1, 1)", Yax) = y (a) and Yo, .guy Vala? = Fi =k, 

(b) Fix O0<e<1 and suppose that |F| > (1 — Sya, Let f: 
{—1, 1}” — {-1, 1} be a random function as in Exercise 1.7. Show 
that for each x € {—1, 1}", except with probability at most 4~” it 
holds that | $. age SOY) < €. 

(c) Deduce that for all but a 2~” fraction of functions f :{—1, 1y > 
{—1, 1}, there a multilinear polynomial q : {—1, 1}" —> R supported 
on the monomials {xs : S € F} such that || f — qllo < €. 

(d) Deduce that all but a 2~” fraction of functions f :{—1, 1}” > 
{—1,1} have PTF representation of degree at most n/2+ 
O(ynTogn). 


(a) Show that in the Berry—Esseen Theorem we can also conclude 


|Pr[S < u] — Pr[Z <ul]| < cy. 


(Hint: Yov’ll need that lims—o+ Pr[Z < u — 6] = Pr[Z < u].) 
(b) Deduce that if J C R is any interval, we can also conclude 


|Pr[S € 1] —Pr[Z € I]| < 2cy. 
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5.17 Show that the assumptions E[X;] = 0 and ee Var[X;] = 1 in the 
Berry—Esseen Theorem are not restrictive, as follows. Let X1,..., Xn 
be independent random variables with finite means and variances. 
Let S = )~"_, X; and let Z ~ N(w, o°), where u = >“7_, ELX;] and 
o = Ya Var[X;]. Assuming o? > 0, show that for all u € R, 


|Pr[S < u] — Pr[Z <u]| < ce/o?, 


where 


e= > |X; — ELX AJI. 
i=1 
5.18 (a) Use the generalized Binomial Theorem to compute the power series 
for (1 — z?)~!/, valid for |z| < 1. 
(b) Integrate to obtain the power series for arcsin z given in (5.9), valid 
for |z| < 1. 
(c) Confirm that equality holds also for z = +1. 
5.19 Verify that the random vector $ defined in (5.7) has 


E[S] 
E[S2] = 0, E[S;] = E[S5] = 1, and E[SS2] = p; i.e., E[S] = E jai 


Cov[S] = f I 
pl 


5.20 Prove Corollary 5.20. 

5.21 Fix n odd. Using Theorem 5.19 show that IMa} (S)| is a decreasing 
function of | S| for odd 1 < |S| < at . Deduce (using also Corollary 5.20) 
that Maj, flo = Maj, ({1}) ~ 4E 

5.22 Prove Corollary 5.21. 

5.23 Prove Theorem 5.18. (Hint: Corollary 5.21.) 

5.24 Complete the proof of Theorem 5.22 by showing that (1 — H +4" 2 
<1+42k/n forall 1 < k <n/2. 

5.25 Using just the facts that Stab,[Maj,,] > 2 arcsino for all p € 
[—1,1] and that Stab,[Maj,,] = Žo W*[Maj„]o*, deduce that 
limpo W* [Maj,,] > [ok \(2 arcsin p) for all k € N. (Hint: By induction 
on k, always taking p “small enough”.) 


5.26 (a) For O< j <m integers, show that IMajz iT h = ic oat : 
2m+1 e”) 
2m \m?' 
(b) Deduce that ÎMajn,iÎi = E [za] et"), where X~ 
Binomial(m, 1/2). 


(c) Deduce that ÎMaj, Îi ~ an. 
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5.27 (a) Show that for each odd k € N, 
(2)? K < [pC arcsin p) < (2)? kA + 00/5). 


(Hint: Stirling’s approximation.) 

(b) Prove Corollary 5.23. (Hint: For the second statement you’ll need to 
approximate the sum J` „4a ik (2) j=?’ by an integral.) 

5.28 For integer O<j<n, define %4; :{—-1, 1} —R by A4(xX)= 
» Sl=j x5. Since X; is symmetric, the value #;(x) depends only on 
the number z of —1’s in x; or equivalently, on }~"_,x;. Thus we 
may define K; : {0,1,...,n} > R by K,(z) = X(x) for any x with 
yo, xi = — 22. 

(a) Show that K ;(z) can be expressed as a degree-j polynomial in z. It 
is called the Kravchuk (or Krawtchouk) polynomial of degree j. (The 
dependence on n is usually implicit.) 

(c) Show for p e [—1, 1] that Dio Kj (x) 0! = 2’ Pr[y = (1,..., DI], 
where y = N, (x). 

(d) Deduce the generating function identity Kj(z)= [oA K — py 
(1+ py"). 

5.29 Prove Proposition 5.24. 


5.30 Prove Proposition 5.25 using the Central Limit Theorem. (Hint for 
W'|[f,,]: use symmetry to show it equals the square of E[ f, (x) >D Axil) 


5.31 Consider the setting of Theorem 5.16. Let S = De aix; where x ~ 
{—1, 1}”, and let Z ~ N(O, 1). 
(a) Show that Pr[|S| > t], Pr[|Z| > t] < 2exp(—t?/2) for all t > 0. 
(b) Recalling E[|Y|] = Pe Pr[|Y| > t]dt for any random variable Y, 
use the Berry—Esseen Theorem (and Remark 5.15, Exercise 5.16) to 
show 


[Etsi — E[1ZI]| < OCT + exp(—T?/2)) 


for any T > 1. 

(c) Deduce | E[|$|] — /2/2| < O(eVlog(1/e)). 

(d) Improve O(e./log(1/e)) to the bound O(e) stated in Theorem 5.16 
by using the nonuniform Berry-Esseen Theorem, which states that 
the bound cy in the Berry—Esseen Theorem can be improved to 
Cy. TET for some constant C. 

5.32 Consider the sequence of LTFs defined in Proposition 5.25. Show that 


lim Stab,[ f,] = A,(@). 


5.36 


5.37 
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Here u = ®(t) and A p(t) is the Gaussian quadrant probability defined 
by A,() = Pr[z, > t, z2 > t], where z1, Z2 are standard Gaussians with 
correlation E[z,z2] = p. Verify also that A,(@) = Pr[z, < t, z2 < t] 
where a = (t). 

In this exercise you will complete the justification of Theorem 5.17 using 
the following multidimensional Berry-Esseen Theorem: 


Theorem 5.38. Let X\,..., Xn be independent R¢-valued random vec- 
tors, each having mean zero. Write S = ey Xi and assume È = 
Cov[S] is invertible. Let Z ~ N(0, £) be a d-dimensional Gaussian with 
the same mean and covariance matrix as S. Then for all convex sets 
UCR, 


| Pr[S € U] — Pr[Z € U]| < Cd'4y, 


; ; -1/2y 13 
where C is a universal constant, y = )~;_, E[|| = / X;||5], and ||- |l2 
denotes the Euclidean norm on R°. 


(a) Let E = p “| where p € (—1, 1). Show that 


s[i- 0 10 
~10 14/0 A—p?)!']]-p 1]° 


(b) Compute y! D~!y for y = Ee e R?. 
mda 


(c) Complete the proof of Theorem 5.17. 

Let @ be a class of Boolean-valued functions, all of input length at 
most n. Show that NS;[f] < nô for all f € B and hence % is uniformly 
noise-stable (in a sense, vacuously). (Hint: Exercise 2.42.) 

Give a simple proof of the following fact, which is a robust form 
of the edge-isoperimetric inequality (for volume 1/2) and a weak 
form of the FKN Theorem: If f : {—1, 1}” —> {—1, 1} has E[f] = 0 
and I[f] <1+4+6, then f is O(6)-close to +x; for some i € [n]. In 
fact, you should be able to achieve 6-closeness (which can be fur- 
ther improved using Theorem 5.33). (Hint: Upper- and lower-bound 
ys fiir < (max; FONG; IFON using Proposition 3.2 and Exer- 
cise 2.5(a).) 

Show that Theorem 5.33 is essentially optimal by exhibiting 
functions f :{—1,1}' —> {—1,1} with both fO) = ] — ô/2 and 
w![f]> 1- 8+ Q(5? log(1/6)), for a sequence of ô tending to 0. 
Prove Corollary 5.32. 
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Fill in the details of the proof of Theorem 5.33. 

Show that if f : {—1, 1}” —> {-1, 1} is an LTF, then <NS;L/] < 
O(1//6). (Hint: The only fact needed about LTFs is the corollary of 
Peres’s Theorem that wf] < O(1/Vk) for all k.) 


As discussed in Section 5.5, Theorem 5.35 implies that an upper bound 
on the total influence of degree-k PTFs is sufficient to derive an upper 
bound on their noise sensitivity. This exercise asks you to show necessity 
as well. More precisely, suppose NS;[f] < e(ô) for all f € A. Show 
that I[f] < O(e(1/n)-n) for all f € A, k. Deduce that A, is uniformly 
noise-stable if and only if I[ f] = o(n) forall f € Z, and that NS;[f] < 

O(k4/6) for all f € A if and only if [f] < O(k/n) for all f € A, x. 

(Hint: Exercise 2.43(a).) 

Estimate carefully the asymptotics of I[ f], where f € PTF,,, is as in the 

strongest form of the Gotsman—Linial Conjecture. 

Let AC {-1,1}” have cardinality a2”, a < 1/2. Thinking of 

{—1, 1}" c R”, let ya € R” be the center of mass of A. Show that u4 is 

close to the origin in Euclidean distance: || ||2 < OG/log(1/a)). 

Show that the Gaussian isoperimetric function satisfies Y” = —1/% on 

(0, 1). Deduce that X is concave. 

Fix a € (0, 1/2). Let f : {—1, 1} > [-1, 1] satisfy E[|f|] < « and 

IFO < e foralli € [n]. Show that W'[ f] < Wa) + Ce, where % is the 

Gaussian isoperimetric function and where the constant C may depend 

on a. (Hint: You will need the nonuniform Berry—Esseen Theorem from 

Exercise 5.31.) 

In this exercise you will show by induction on k that Inf[ f] < 2n!~!/ a 

for all degree-k PTFs f : {—1, 1}” — {-1, 1}. The k = 0 case is trivial. 

So fork > 0, suppose f = sgn(p) where p : {—1, 1}” — R is a degree-k 

polynomial that is never 0. 

(a) Show for i €[n] that E[f(x)x;sgn(D; p(x))] = Inf;[ f]. (Hint: 
First use the decomposition f = x;D; f +E; f to reach E[D;f - 
sgn(D; p)]; then show that D; f = sgn(D; p) whenever D; f 4 0.) 

(b) Conclude that I[ f] < E[| 5°; x;sgn(D; p(x))|]. Remark: When k = 2 
and thus each sgn(D; p) is an LTF, it is conjectured that this bound is 
still O(./n). 

(c) Apply Cauchy—Schwarz and deduce 


ILF] < : + J Elx;x jsgn(D; pœ)sga(D; p(x))]. 
ij 
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(d) Use Exercise 2.19 and the AM-GM inequality to obtain I[f] < 
Jn +X Ilsen; p). 

(e) Complete the induction. 

(f) Finally, deduce that the class of degree-k PTFs is uniformly noise- 
stable, specifically, that every degree-k PTF f satisfies NSs[f] < 
38!/% for all 5 € (0, 1/2]. (Hint: Theorem 5.35.) 


Notes 


Chow’s Theorem was proved by independently by Chow (Chow, 1961) and by Tannen- 
baum (Tannenbaum, 1961) in 1961; see also Elgot (Elgot, 1961). The generalization 
to PTFs (Theorem 5.8) is due to Bruck (Bruck, 1990), as is Theorem 5.10 and Exer- 
cise 5.12. Theorems 5.2 and 5.9 are from Gotsman and Linial (Gotsman and Linial, 
1994) and may be called the Gotsman—Linial Theorems; this work also contains the 
Gotsman-Linial Conjecture and Exercise 5.11. Conjecture 5.3 should be considered 
folklore. Corollary 5.13 was proved by Bruck and Smolensky (Bruck and Smolensky, 
1992); they also essentially proved Theorem 5.12 (but see (Siu and Bruck, 1991)). 
Exercise 5.13 is usually credited to Krause and Pudlak (Krause and Pudlak, 1997). The 
upper bound in Exercise 5.4 is asymptotically sharp (Zuev, 1989). Exercise 5.15 is from 
O’Donnell and Servedio (O’ Donnell and Servedio, 2008). 

Theorem 2.33 and Proposition 2.58, discussed in Section 5.2, were essentially 
proved by Titsworth in 1962 (Titsworth, 1962); see also (Titsworth, 1963). More pre- 
cisely, Titsworth solved a version of the problem from Exercise 5.7. His motivation 
was in fact the construction of “interplanetary ranging systems” for measuring deep 
space distances, e.g., the distance from Earth to Venus. The connection between rang- 
ing systems and Boolean functions was suggested by his advisor, Solomon Golomb. 
Titsworth (Titsworth, 1962) was also the first to compute the Fourier expansion of 
Maj„. His approach involved generating functions and contour integration. Other 
approaches have used special properties of binomial coefficients (Brandman, 1987) 
or of Kravchuk polynomials (Kalai, 2002). The asymptotics of W*[Maj,,] described 
in Section 5.3 may have first appeared in Kalai (Kalai, 2002), with the error bounds 
being from O’Donnell (O’ Donnell, 2003). Kravchuk polynomials were introduced by 
Kravchuk (Kravchuk, 1929). 

The Berry—Esseen Theorem is due independently to Berry (Berry, 1941) and 
Esseen (Esseen, 1942). Shevtsova (Shevtsova, 2013) has the record for the smallest 
known constant B that works therein: roughly .5514. The nonuniform version described 
in Exercise 5.31 is due to Bikelis (Bikelis, 1966). The multidimensional version Theo- 
rem 5.38 stated in Exercise 5.33 is due to Bentkus (Bentkus, 2004). Sheppard proved 
his formula in 1899 (Sheppard, 1899). The results of Theorem 5.18 may have appeared 
first in O’Donnell (O’ Donnell, 2004, 2003). 

The Level-1 Inequality should probably be considered folklore; it was perhaps first 
published in Talagrand (Talagrand, 1996) and we have followed his proof. The first 
half of the 2 Theorem is from Khot et al. (Khot et al., 2007); the second half is from 
Matulef et al. (Matulef et al., 2010). Theorem 5.33, which improves the FKN Theo- 
rem to achieve “closeness” 5/4, was independently obtained by Jendrej, Oleszkiewicz, 
and Wojtaszczyk (Jendrej et al., 2012), as was Exercise 5.36 showing optimality of 
this closeness. The closeness achieved in the original proof of the FKN Theorem 
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(Friedgut et al., 2002) was 5/2; that proof (like ours) relies on having a separate proof of 
closeness O(6). Kindler and Safra (Kindler and Safra, 2002; Kindler, 2002) gave a self- 
contained proof of the 5/2 bound relying only on the Hoeffding bound. The content of 
Exercise 5.35 was communicated to the author by Eric Blais. The result of Exercise 5.44 
is from (Khot et al., 2007); Exercise 5.42 was suggested by Rocco Servedio. 

Peres’s Theorem was published in 2004 (Peres, 2004) but was mentioned as early as 
1999 by Benjamini, Kalai, and Schramm (Benjamini et al., 1999). The work (Benjamini 
et al., 1999) introduced the definition of uniform noise stability and showed that the 
class of all LTFs satisfies it; however, their upper bound on the noise sensitivity of 
LTFs was O(5!/*), worse than Peres’s. The proof of Peres’s Theorem that we presented 
is a simplification due to Parikshit Gopalan and incorporates an idea of Diakonikolas 
et al. (Diakonikolas et al., 2010; Harsha et al., 2010). Regarding the total influence of 
PTFs, the work of Kane (Kane, 2012) shows that every degree-k PTF on n variables has 
ILf] < poly(k)n'~!/°®, which is better than Theorem 5.37 for certain superconstant 
values of k. Exercise 5.39 was suggested by Nitin Saurabh. 


6 


Pseudorandomness and F2-Polynomials 


In this chapter we discuss various notions of pseudorandomness for Boolean 
functions; by this we mean properties of a fixed Boolean function that are 
in some way characteristic of randomly chosen functions. We will see some 
deterministic constructions of pseudorandom probability density functions with 
small support; these have algorithmic application in the field of derandomiza- 
tion. Finally, several of the results in the chapter will involve interplay between 
the representation of f : {0, 1}" — {0, 1} as a polynomial over the reals and 
its representation as a polynomial over F3. 


6.1. Notions of Pseudorandomness 


The most obvious spectral property of a truly random function f : {—1, 1} > 
{—1, 1} is that all of its Fourier coefficients are very small (as we saw in 
Exercise 5.8). Let’s switch notation to f : {—1, 1}” — {0, 1}; in this case f (Ø) 
will not be very small but rather very close to 1/2. Generalizing: 


Proposition 6.1. Letn > 1 and let f : {—1, 1}" — {0, 1} be a p-biased ran- 
dom function; i.e., each f(x) is 1 with probability p and 0 with probability 
1 — p, independently for all x € {—1, 1}". Then except with probability at 
most 2~”, all of the following hold: 


IFØ) — pl<2Vn2-"?, VSB [FO < 2Vn2-”. 


Proof. We have f(S)= yo aX x5 f(x), where ue random variables f (x) are 
independent. If S = Ø, then the coefficients + +x sum to 1 and the mean of F (S) 
is p; otherwise the coefficients sum to 0 sind the mean of F (S) is 0. Either way 
we may apply the Hoeffding bound to conclude that 


Pr[| f(S) — E[f(S)]] > t] < 2exp(—r? -2"-') 
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for any t > 0. Selecting t = 2,/n2~"/”, the above bound is 2 exp(—2n) < 47”. 
The result follows by taking a union bound over all S C [n]. 


This proposition motivates the following basic notion of “pseudorandom- 
ness”: 


Definition 6.2. A function f : {—1, 1}" > R is €-regular (sometimes called 
e-uniform) if | f(S)| < € for all S 4 Ø. 


Remark 6.3. By Exercise 3.9, every function f is €-regular for € = || f'||1. We 
are often concerned with f : {—1, 1}” — [-—1, 1], in which case we focus on 
e<l. 


Example 6.4. Proposition 6.1 states that a random p-biased function is 
(2./n2-"/*)-regular with very high probability. A function is 0-regular if and 
only if it is constant (even though you might not think of a constant function as 
very “random”). If A C F is an affine subspace of codimension k then 14 is 
2~k -regular (Proposition 3.12). For n even the inner product mod 2 function and 
the complete quadratic function, IP,,, CQ, : F3 — {0, 1}, are 2-"/2-| regular 
(Exercise 1.1). On the other hand, the parity functions xs : {—1, 1}” > {—1, 1} 
are not €-regular for any € < 1 (except for S = Ø). By Exercise 5.21, Maj, is 


Fi -regular. 


The notion of regularity can be particularly useful for probability density 
functions; in this case it is traditional to use an alternate name: 


Definition 6.5. If g : F} > R° is a probability density which is ¢-regular, 
we call it an €-biased density. Equivalently, p is €-biased if and only if 
| Ex~[x,(x)]| < € for all y € Fr \ {0}; thus one can think of “e-biased” as 
meaning “at most €-biased on subspaces”. Note that the marginal of such 
a distribution on any set of coordinates J C [n] is also €-biased. If ọ is 
ga = 14/E[1,4] for some A C F} we call A an €-biased set. 


Example 6.6. For y a probability density we have ||g||; = E[g] = 1, so every 
density is 1-biased. The density corresponding to the uniform distribution 
on F}, namely g = 1, is the only 0-biased density. Densities corresponding to 
the uniform distribution on smaller affine subspaces are “maximally biased”: if 
A C F3 is an affine subspace of codimension less than n, then g4 is not €-biased 
for any € < 1 (Proposition 3.12 again). If E = {(0, ..., 0), (1, ..., I}, then E 
is a 1/2-biased set (an easy computation, see also Exercise 1.1(h)). 
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There is a “combinatorial” property of functions f that is roughly equiv- 
alent to €-regularity. Recall from Exercise 1.29 that Î f i has an equivalent 
non-Fourier formula: Ey y -[ f(x) f(y) f(z) f(x + y + z)]. We show (roughly 
speaking) that f is regular if and only if this expectation is not much bigger 
than E[ f] = Ex yz wl ff) f@ Fwy]: 


Proposition 6.7. Let f : F} > R. Then 
(1) If f is €-regular, then ffl; —ELf]* < € - Var[ f]. 
(2) If f is not €-regular, then iri — E[ f] > é. 
Proof. If f is €-regular, then 


[lls ELA = X F(S)* < maxt FS) J FS)? < e- Varl f]. 


SAD SAU 


On the other hand, if f is not €-regular, then |F(T)| > € for some T Æ Ø; hence 
if il; is at least FØ + AT} > ELI + et. 


The condition of e-regularity — that all non-empty-set coefficients are 
small — is quite strong. As we saw when investigating the 2 Theorem in 
Chapter 5.4 it’s also interesting to consider f that merely have fü <e 
for all i € [n]; for monotone f this is the same as saying Inf;[f] < € for i. 
This suggests two weaker possible notions of pseudorandomness: having all 
low-degree Fourier coefficients small, and having all influences small. We will 
consider both possibilities, starting with the second. 

Now a randomly chosen f :{—1, 1}” —> {—1, 1} will not have all of its 
influences small; in fact as we saw in Exercise 2.12, each Inf;[f] is 1/2 
in expectation. However, for any ô > 0 it will have all of its (1 — ô)-stable 
influences exponentially small (recall Definition 2.52). In Exercise 6.2 you will 
show: 


Fact 6.8. Fix ô € [0, 1] and let f : {—1, 1}" —> {—1, 1} be a randomly chosen 
function. Then for any i € [n], 
a = 8/2)" 

2-6 


This motivates a very important notion of pseudorandomness in the anal- 


Enf [f] = 


ysis of Boolean functions: having all stable-influences small. Recalling the 
discussion surrounding Proposition 2.54, we can also describe this as having 
no “notable” coordinates. 


Definition 6.9. We say that f : {—1, 1}" —> R has (e€, ô)-small stable influ- 
ences, or no (€, 5)-notable coordinates, if Inf Pf] < e foreachi € [n]. This 
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condition gets stronger as € and ô decrease: when ô = 0, meaning Inf;[f] < € 
for alli, we simply say f has €-small influences. 


Example 6.10. Besides random functions, important examples of Boolean- 
valued functions with no notable coordinates are constants, majority, and large 
parities. Constant functions are the ultimate in this regard: they have (0, 0)-small 
stable influences. (Indeed, constant functions are the only ones with 0-small 
influences.) The Maj,, function has 4=-small influences. To see the distinction 
between influences and stable influences, consider the parity functions xs. 
Any parity function xs (with S 4 Ø) has at least one coordinate with maximal 
influence, 1. But if |S| is “large” then all of its stable influences will be small: 
We have Inf‘’~” [xs] equal to (1 — 5)!5!-! when i € S and equal to 0 otherwise; 
i.e., xs has ((1 — 8)!51-!, 8)-small stable influences. In particular, xs has (€, 5)- 
small stable influences whenever |S] > miele) 

The prototypical example of a function f : {—1, 1}” —> {—1, 1} that does 
not have small stable influences is an unbiased k-junta. Such a function has 
Var[f] = 1 and hence from Fact 2.53 the sum of its (1 — 6)-stable influ- 
ences is at least (1 — 5)*—!. Thus Inf) Lf] > (1 — 6)"'/k for at least one i; 
hence f does not have ((1 — 8) /k, ô)-small stable influences for any ô € (0, 1). 
A somewhat different example is the function f(x) = xọMaj„(x1, ..-, Xn), 
which has Inf\'~”[f] > 1 — v5; see Exercise 6.5(d). 


Let’s return to considering the interesting condition that Ifi )| < € for all 
i € [n]. We will call this condition (e€, 1)-regularity. It is equivalent to saying 
that f=! is €-regular, or that f has at most e “correlation” with every dictator: 
(f, £xi)| < € for all i. Our third notion of pseudorandomness extends this 
condition to higher degrees: 


Definition 6.11. A function f : {—1, 1}” > R is (e, k)-regular if | F(S)| <e 
for all < |S| < k; equivalently, if f=* is e-regular. For k = n (or k = 00), this 
condition coincides with e-regularity. When ọ : F} > R= is an (e, k)-regular 
probability density, it is more usual to call g (and the associated probability 
distribution) (€, k)-wise independent. 


Below we give two alternate characterizations of (€, k)-regularity; however, 
they are fairly “rough” in the sense that they have exponential losses on k. This 
can be acceptable if k is thought of as a constant. The first characterization is 
that f is (€, k)-regular if and only if fixing k input coordinates changes f’s 
mean by at most O(e). The second characterization is the condition that f has 
O (e) covariance with every k-junta. 
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Proposition 6.12. Let f : {—1,1}" ~ R and let € > 0, k EN. 


(1) If f is (€,k)-regular then any restriction of at most k coordinates 
changes f’s mean by at most 2*e. 

(2) If f is not (€, k)-regular then some restriction to at most k coordinates 
changes f’s mean by more than €. 


Proposition 6.13. Let f : {—1, 1} > Rand lete > 0, k EN. 
(1) If f is (e€, k)-regular, then Cov[ f, h] < {jhil,¢ for any h : {—1, 1!" > 
R with deg(h) < k. In particular, Cov[ f, h] < 2*°e for any k-junta 
h : {-1, 1} > {-1, 1}. 
(2) If f is not (€,k)-regular, then Cov[f, h] > € for some k-junta h : 
fol 1” > {-1, 1}. 
We will prove Proposition 6.12, leaving the proof of Proposition 6.13 to the 
exercises. 


Proof of Proposition 6.12. For the first statement, suppose f is (€, k)-regular 
and let J C [n],z € {—1, 1}’”, where | J| < k. Then the statement holds because 


ELfj.1=/O+ X f(r)" 
DATCI 


(Exercise 1.15) and each of the at most 2% terms | f(T) a |f(T)| is at 
most €. 

For the second statement, suppose that IFO )| > €, where 0 < |U| < k. 
Then a given restriction z € {—1, 1}’ changes f’s mean by 


he)= X fT). 


OAT SI 


We need to show that ||/1||.o > €, and this follows from 
lhllæ = llh xsl > |ELhxs]l = ADI = |F DI > e. 


Taking € = 0 in the above two propositions we obtain: 


Corollary 6.14. For f : {—1, 1}” — R, the following are equivalent: 
(1) f is (O, k)-regular. 


(2) Every restriction of at most k coordinates leaves f ’s mean unchanged. 
(3) Cov[ f, h] = 0 for every k-junta h : {—1, 1}" > {-1, 1}. 


If f is a probability density, condition (3) is equivalent to E,~ ¢[h(x)] = E[h] 
for every k-junta h : {—1, 1} > {-1, 1}. 


For such functions, additional terminology is used: 
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small influences 


a 


regular 


N 


low-degree regular 


small stable influences 
(no notable coordinates) 


Figure 6.1. Comparing notions of pseudorandomness: arrows go from 
stronger notions to (strictly) weaker ones 


Definition 6.15. If f : {—1, 1}" —> {—1, 1} is (0, k)-regular, it is also called 
kth-order correlation immune. If f is in addition unbiased, then it is called 
k-resilient. Finally, if  : F} + R=° is a (0, k)-regular probability density, then 
we call ọ (and the associated probability distribution) k-wise independent. 


Example 6.16. Any parity function xs : {—1, 1}” > {—1, 1} with |$|} =k+ 1 
is k-resilient. More generally, so is xs - g for any g : {—1, 1}” — {—1, 1} that 
does not depend on the coordinates in S. For a good example of a correlation 
immune function that is not resilient, consider h : {—1, 1}°” —> {—1, 1} defined 
by h = X«1,....2m}) A Xtm+1,...,.3m}- This h is not unbiased, being True on only a 
1/4-fraction of inputs. However, its bias does not change unless at least 2m 
input bits are fixed; hence h is (2m — 1)th-order correlation immune. 


We conclude this section with Figure 6.1, indicating how our various 
notions of pseudorandomness compare. For precise quantitative statements, 
counterexamples showing that no other relationships are possible, and expla- 
nations for why these notions essentially coincide for monotone functions, see 
Exercise 6.5. 


6.2. F,-Polynomials 


We began our study of Boolean functions in Chapter 1.2 by considering their 
polynomial representations over the real field. In this section we take a brief 
look at their polynomial representations over the field F2, with False, True 
being represented by 0, 1 € F as usual. Note that in the field F3, the arithmetic 
operations + and - correspond to logical XOR and logical AND, respectively. 


Example 6.17. Consider the logical parity (XOR) function on n bits, x;y). 
To represent it over the reals (as we have done so far) we encode False, True 
by +1 € R; then xin : {—1, 1}” —> {—1, 1} has the polynomial representation 
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Xin] (x) = X1X2-- + Xn. Suppose instead we encode False, True by 0, 1 € F2; then 
Xin] : F5 — Fy has the polynomial representation Xin; (x) = x1 + x2 +++ + Xn. 
Notice this polynomial has degree 1, whereas the representation over the reals 
has degree n. 


In general, let f : F} —> F2 be any Boolean function. Just as in Chapter 1.2 
we can find a (multilinear) polynomial representation for it by interpolation. 
The indicator function 1,,} : F} —> F> for a € F} can be written as 


lax) = [] « [[ a--. (6.1) 
tami i:a;=0 
a degree-n multilinear polynomial. (We could have written 1 + x; rather than 


1 — x; since these are the same in F2.) Hence f has the multilinear polynomial 
expression 


fx) = Yo f@1 a). (6.2) 


n 
acF; 


After simplification, this may be put in the form 


fœ) = J css, (6.3) 


S¢{n] 


where x’ = Ties x; as usual, and each coefficient cs is in F2. We call (6.3) 
the F2-polynomial representation of f. As an example, if f = xy; is the parity 
function on 3 bits, its interpolation is 


xB) = (1 — xA — x2)x3 + C1 — x1) x21 — x3) 
+ x1(1 — x2)(1 — x3) + x1 x2%3 
= X1 + x2 + x3 — 2(x1x2 + x1xX3 + X23) + 441 x2%3 (6.4) 
= x1 + X2 + X3 


as expected. We also have uniqueness of the F2-polynomial representation; the 
quickest way to see this is to note that there are 2?" functions F} —> F and also 
2?" possible choices for the coefficients cs. Summarizing: 


Proposition 6.18. Every f : F; — F has a unique F2-polynomial represen- 
tation as in (6.3). 


Example 6.19. The logical AND function AND, : F} — F% has the simple 
expansion AND, (x) = x1x2 -+ - Xn. The inner product mod 2 function has the 
degree-2 expansion IPn (x1, .. <, Xn, Yi, <--> Yn) = X1 Y1 + X2y2 tee + XnYn- 


Since the F2-polynomial representation is unique we may define F2-degree: 
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Definition 6.20. The F2-degree of a Boolean function f : {False, True}” > 
{False, True}, denoted deg, (f), is the degree of its F,-polynomial representa- 
tion. We reserve the notation deg( f ) for the degree of f’s Fourier expansion. 


We can also give a formula for the coefficients of the F2-polynomial repre- 
sentation: 


Proposition 6.21. Suppose f : F} —> F2 has F-polynomial representation 
f(x) = J sct] csx5. Then CS = È supos f(x). 


Corollary 6.22. Let f : {False, True}” — {False, True}. Then degy,(f) =n if 
and only if f (x) = True for an odd number of inputs x. 


The proof of Proposition 6.21 is left for Exercise 6.10; Corollary 6.22 is just 
the case S = [n]. You can also directly see that cn) = )~, f(x) by observing 
what happens with the monomial xx - - - x, in the interpolation (6.1), (6.2). 
Given a generic Boolean function f : {False, True}” — {False, True} it’s 
natural to ask about the relationship between its Fourier expansion (i.e., 
polynomial representation over R) and its F2-polynomial representation. In 
fact you can easily derive the F-representation from the R-representation. 
Suppose p(x) is the Fourier expansion of f; i.e., f’s R-multilinear rep- 
resentation when we interpret False, True as +1 € R. From Exercise 1.9, 
g(x) = 5 = 5 p(l — 2x,,..., 1—2x,) is the unique R-multilinear represen- 
tation for f when we interpret False, True as 0, 1 € R. But we can also obtain 
q(x) by carrying out the interpolation in (6.1), (6.2) over Z. Thus the F, rep- 
resentation of f is obtained simply by reducing g(x)’s (integer) coefficients 


modulo 2. 


We saw an example of this derivation above with x;3). The + 1-representation 
is xX} X2x3. The representation over {0, 1} € Z C Ris 5 — 3a — 2x,)(1 — 2x2) 
(1 — 2x3), which when expanded equals (6.4) and has integer coefficients. 
Finally, we obtain the F, representation x, + x2 + x3 by reducing the coeffi- 
cients of (6.4) modulo 2. 

One thing to note about this transformation from Fourier expansion to 
F-representation is that it can only decrease degree. As noted in Exercise 1.11, 
the first step, forming g(x) = 5 — ipa — 2x1, ..., L — 2x,), does not change 
the degree at all (except if p(x) = 1, g(x) = 0). And the second step, reducing 


q’s coefficients modulo 2, cannot increase the degree. We conclude: 
Proposition 6.23. Let f : {—1, 1}" > {—1, 1}. Then degg,(f) < deg( f). 


Here is an interesting consequence of this proposition. Suppose 
f:{—1, 1} > {-1, 1} is k-resilient; i.e., f(S)=0 for all |S|<k<n. 


6.2. F2-Polynomials 139 


Let g = Xin: f; thus (S) = Fin] \ S) and hence deg(g)<n-—k-l1. 
From Proposition 6.23 we deduce degp,(g) < n — k — 1. But if we interpret 
fig: F3 > Fy, then g =x, +--+ x, + f and hence degp, (g) = deg, (f) 
(unless f is parity or its negation). Thus: 


Proposition 6.24. Let f : {—1, 1}" —> {-1, 1} be k-resilient, k < n — 1. Then 
degr, (f) <n—k—-1. 


This proposition was shown by Siegenthaler, a cryptographer who was 
studying stream ciphers; his motivation is discussed further in the notes in 
Section 6.6. More generally, Siegenthaler proved the following result (the proof 
does not require Fourier analysis): 


Siegenthaler’s Theorem. Proposition 6.24 holds. Further, if f is merely kth- 
order correlation immune, then we still have degg, (f) < n — k (for k < n). 


Proof. Pick a monomial x” of maximal degree d = degr, (f) in f’s F2- 
polynomial representation; we may assume d > 1 else we are done. Make 
an arbitrary restriction to the n — d coordinates outside of J, forming func- 
tion g : FZ — F}. The monomial x” still appears in g’s F,-polynomial repre- 
sentation; thus by Corollary 6.22, g is 1 for an odd number of inputs. 

Let us first show Proposition 6.24. Assuming f is k-resilient, it is unbiased. 
But g is 1 for an odd number of inputs so it cannot be unbiased (since 2%! 
is even for d > 1). Thus the restriction changed f’s bias, and we must have 
n—d>k,henced<n—k-—1. 

Suppose now f is merely kth-order correlation immune. Pick an arbitrary 
input coordinate for g and suppose its two possible restrictions give subfunc- 
tions go and gı. Since g has an odd number of 1’s, one of go has an odd number 
of 1’s and the other has an even number. In particular, go and g; have different 
biases. One of these biases must differ from f’s. Thus n — d + 1 > k, hence 
d<n-—k. 


We end this section by mentioning another bound related to correlation 
immunity: 


Theorem 6.25. Suppose f : {—1,1}" —> {-1, 1} is kth-order correlation 
immune but not k-resilient (i.e., E[ f] # 0). Thenk + 1 < in. 


The proof of this theorem (left to Exercise 6.14) uses the Fourier expansion 
rather than the F-representation. The bounds in both Siegenthaler’s Theorem 
and Theorem 6.25 can be sharp in many cases; see Exercise 6.15. 
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6.3. Constructions of Various Pseudorandom Functions 


In this section we give some constructions of Boolean functions with strong 
pseudorandomness properties. We begin by discussing bent functions: 


Definition 6.26. A function af: F} — {—1, 1} (with n even) is called bent if 
| f(y)| = 27"? for all y € F. 


Bent functions are 2~"/?-regular. If the definition of e-regularity were 


changed so that even | f(0)| needed to be at most €, then bent functions would 
be the most regular possible functions. This is because ay f fy} = = | for any 
f : F} — {—1, 1} and hence at least one niga) must be at least 2~”/?. In par- 
ticular, bent functions are those that are maximally distant from the class of 
affine functions, {xy : y € Py. 

We have encountered some bent functions already. The canonical example 
is the inner product mod 2 function, IP, (x) = X (x1Xn/2+1 + X2%n/242 +++ + 
Xn/2Xn). (Recall the notation x (b) = (—1)’.) For n = 2 this is just the AND3 
function 5 + $x] + 5x2 — $x1X2, which is bent by inspection. For general n, 
the bentness is a consequence of the following fact (proved in Exercise 6.16): 


Proposition 6.27. Let f : F} —> {—1, 1} and g : Fy — {-1, 1} be bent. Then 
feg: pin — {-1, 1} defined by (f ® g)(x, x") = f(x)g(x’) is also bent. 
Another example of a bent function is the complete quadratic function 


CQ, (x) = XÈ isi<jen xixj) from Exercise 1.1. Actually, in some sense it is 
the “same” example, as we now explain. 


Proposition 6.28. Let f : F, — {—1, 1} be bent. Then +x, - f is bent for any 
y € F}, as is f o M for any invertible linear transformation M : F} > F}. 


Proof. Multiplying by — 1 does not change bentness, and both x, - f and f o M 
have the same Fourier coefficients as f up to a permutation (see Exercise 3.1). 


We claim that CQ, arises from f = IP, as in Proposition 6.28. In the 
case n = 4, this is because Fiaa xixj = (x1 + x3)(x2 + x3) + (x1 + x2 + 
x3)x4 + x3 over F3; thus 


CQ,(x) = IP4(Mx) - X0,0,1,0(x), where M = 


is invertible. 


The general case is left to Exercise 6.20. In fact, every bent f with degr, (f) < 2 
arises by applying Proposition 6.28 to the inner product mod 2 function; see 
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Exercise 6.19. There are other large families of bent functions; however, the 
problem of classifying all bent functions is open and seems difficult. We content 
ourselves by describing one more family: 


Proposition 6.29. Let f: F?” — {—-1,1} be defined by f(x,y)= 
IP2,(x, y)g(y) where g : {—1, 1}” —> {—1, 1} is arbitrary. Then f is bent. 


Proof. We will think of y € F, so IPon(x, y) = Xy(x). We’ll also write a 
generic y € F?” as (y1, y2). Then indeed 


fo) = E OXE, y= E [so)x.09) ElXy+y œ| 


= Els OX Oyy] = 2 "8X 1) = £2”. 


We next discuss explicit constructions of small e-biased sets, which are 
of considerable use in the field of algorithmic derandomization. The most 
basic step in a randomized algorithm is drawing a string x ~ F} from the 
uniform distribution; however, this has the “cost” of generating n independent, 
random bits. But sometimes it’s not necessary that x precisely have the uniform 
distribution; it may suffice that x be drawn from an €-biased density. If we can 
deterministically find an €-biased (multi-)set A of cardinality, say, 2°, then we 
can generate x ~ p4 using just £ independent random bits. We will see some 
example derandomizations of this nature in Section 6.4; for now we discuss 
constructions. 

Fix £ e N+ and recall that there exists a finite field F. with exactly 2° 
elements. It is easy to find an explicit representation for Fx — a complete 
addition and multiplication table, say — in time 2°. (In fact, one can compute 
within F> even in deterministic poly(¢) time.) The field elements x € Fx are 
naturally encoded by distinct -bit vectors; we will write enc : Fy —> F$ for 
this encoding. The encoding is linear; i.e., it satisfies enc(0) = (0,..., 0) and 
enc(x + y) = enc(x) + enc(y) for all x, y € Fy. 


Theorem 6.30. There is a deterministic algorithm that, given n > 1 and0 < 
€ < 1/2, runs in poly(n/e) time and outputs a multiset A C F} of cardinality 
at most 16(n/€)* with the property that g4 is an €-biased density. 


Proof. It suffices to obtain cardinality (n/e)* under the assumption that € = 2~' 
andn = 2° are integer powers of 2. We will describe a probability density g on 
F3 by giving a procedure for drawing a string y ~ g which uses 2¢ independent 
random bits. A will be the multiset of 27“ = (n/e)? possible outcomes for y. 
It will be clear that A can be generated in deterministic polynomial time. The 
goal will be to show that g is 2~‘-biased. 
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To draw y ~ ọ, first choose r,s ~ Fx independently and uniformly. This 
uses 2£ independent random bits. Then define the ith coordinate of y by 


yi = (enc(r‘), enc(s)), i € [n], 


where the inner product (-, -) takes place in F$. Fixing y € Fi \ {0}, we need 
to argue that | E[x,,(y)]| < 27. Now over FS, 


(7, 9) = SO n lenc(r’), ene(s)) = (> yienc(r’), encis)} 
i=1 


i=l 


= fenc(S> yir'), ene(s)), 


i=l 


where the last step used linearity of enc. Thus 
Elx/(9)] = E-D] = E [E peer], 6.5) 


where p, : Fx — Fx is the polynomial a > yja + ya? + -< + yna”. This 
polynomial is of degree at most n, and is nonzero since y 4 0. Hence it has 
at most n roots (zeroes) over the field Fe. Whenever r is one of these roots, 
enc(p,(r)) = 0 and the inner expectation in (6.5) is 1. But whenever r is not a 
root of p, we have enc(p,(r)) Æ 0 and so the inner expectation is 0. (We are 
using Fact 1.7 here.) We deduce that 

Hont 


0 < E[x,(y)] < Prir is a root of p,] < x 


which is stronger than what we need. 


The bound of O(n/e)’ in this theorem is fairly close to being optimally 
small; see Exercise 6.24 and the notes for this chapter. 

Another useful tool in derandomization is that of k-wise independent distri- 
butions. Sometimes a randomized algorithm using n independent random bits 
will still work assuming only that every subset of k of the bits is independent. 
Thus as with €-biased sets, it’s worthwhile to come up with deterministic con- 
structions of small sets A C F} such that the density function g4 is k-wise 
independent (1.e., (0, &)-regular). The best known examples have the additional 
pleasant feature that A is a linear subspace of F7; in this case, k-wise indepen- 
dence is easy to characterize: 


Proposition 6.31. Let H be an m x n matrix over F, and let A < F3} be the 
span of H’s rows. Then a is k-wise independent if and only if any sum of at 
most k columns of H is nonzero in F}. (We exclude the “empty” sum.) 
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Proof. Since ga = Deca Xy (Proposition 3.11), pa is k-wise independent 
if and only if |y| > k for every y € A+ \ {0}. But y € A+ if and only if 
Hy =0. 


Here is a simple construction of such a matrix with m ~ k logn: 


Theorem 6.32. Let k, € N+ and assume n=2'>k. Then for m= 
(k — 1)€+1, there is a matrix H € F5'*" such that any sum of at most k 
columns of H is nonzero in F}. 


Proof. Write a, ..., Œn for the elements of the finite field F,,, and consider the 
following matrix H’ € F**": 


1 1 1 
Qı a2 Qn 
2 2 2 
H=| % ay a, 
k-1 k-1 k-1 
Oy Ao) a, 


Any submatrix of H’ formed by choosing k columns is a Vandermonde matrix 
and is therefore nonsingular. Hence any subset of k columns of H’ is linearly 
independent in F*. In particular, any sum of at most k columns of H’ is 
nonzero in Ft. Now form H e F5'*" from H’ by replacing each entry ai 
(i > 0) with enc(a}), thought of as a column vector in FS. Since enc is a linear 
map we may conclude that any sum of at most k columns of H is nonzero 
in FY. 


Corollary 6.33. There is a deterministic algorithm that, given integers 1 < 
k < n, runs in poly(n*) time and outputs a subspace A < F of cardinality at 
most 2*n*—! such that g4 is k-wise independent. 


Proof. It suffices to assume n = 2° is a power of 2 and then obtain cardinality 
2nk-! = 2%-D+1 Th this case, the algorithm constructs H as in Theorem 6.32 
and takes A to be the span of its rows. The fact that gy, is k-wise independent 
is immediate from Proposition 6.31. 


For constant k this upper bound of O(n*!) is close to optimal. It can be 
improved to O(n'*/2!), but there is a lower bound of Q(n“*/?!) for constant k; 
see Exercises 6.27, 6.28. 

We conclude this section by noting that taking an €-biased density within a 
k-wise independent subspace yields an (€, k)-wise independent density: 
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Lemma 6.34. Suppose H € F'*" is such that any sum of at most k columns 
of H is nonzero in F}. Let ọ be an €-biased density on F}. Consider draw- 
ing y ~ @ and setting z = y'H € F}. Then the density of z is (e€, k)-wise 
independent. 


Proof. Suppose y € Fè has 0 < |y| < k. Then Hy is nonzero by assumption 
and hence | E[x,(z)]| = Eee] < € since g is €-biased. 


As a consequence, combining the constructions of Theorem 6.30 and The- 
orem 6.32 gives an (€,k)-wise independent distribution that can be sam- 
pled from using only O(log k + log log(n) + log(1/e)) independent random 
bits: 


Theorem 6.35. There is a deterministic algorithm that, given integers 1 < 
k <n and also 0 < € < 1/2, runs in time poly(n/e) and outputs a multiset 
A C F} of cardinality O(k log(n)/€)? (a power of 2) such that 4 is (€, k)-wise 
independent. 


6.4. Applications in Learning and Testing 


In this section we describe some applications of our study of pseudorandom- 
ness. 

We begin with a notorious open problem from learning theory, that of learn- 
ing juntas. Let @={f:F5 —> F, | f isak-junta}; we will always assume 
that k < O(log n). In the query access model, it is quite easy to learn @ exactly 
(i.e., with error 0) in poly(7) time (Exercise 3.37(a)). However, in the model of 
random examples, it’s not obvious how to learn @ more efficiently than in the 
n* . poly(n) time required by the Low-Degree Algorithm (see Theorem 3.36). 
Unfortunately, this is superpolynomial as soon as k > w(1). The state of affairs 
is the same in the case of depth-k decision trees (a superclass of @), and is sim- 
ilar in the case of poly(n)-size DNFs and CNFs. Thus if we wish to learn, say, 
poly(n)-size decision trees or DNFs from random examples only, a necessary 
prerequisite is doing the same for O(log n)-juntas. 

Whether or not w(1)-juntas can be learned from random examples in poly- 
nomial time is a longstanding open problem. Here we will show a modest 
improvement on the n‘-time algorithm: 


Theorem 6.36. For k < O(logn), the class 6 = {f : F} —> Fz | fisa 
k-junta} can be exactly learned from random examples in time n°/* . poly(n). 
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(The 3/4 in this theorem can in fact be replaced by w/(w + 1), where w is any 
number such that n x n matrices can be multiplied in time O(n”).) 

The first observation we will use to prove Theorem 6.36 is that to learn 
k-juntas, it suffices to be able to identify a single coordinate that is relevant (see 
Definition 2.18). The proof of this is fairly simple and is left for Exercise 6.31: 


Lemma 6.37. Theorem 6.36 follows from the existence of a learning algorithm 
that, given random examples from a nonconstant k-junta f : F} — Fo, finds 
at least one relevant coordinate for f (with probability at least 1 — 5) in time 
n/A . poly(n) - log(1/6). 


Assume then that we have random example access to a (nonconstant) k-junta 
f : F} — F2. As in the Low-Degree Algorithm we will estimate the Fourier 
coefficients FS) for all 1 < |S| < d, where d < k is a parameter to be chosen 
later. Using Proposition 3.30 we can ensure that all estimates are accurate 
to within (1/3)2~*, except with probability most 6/2, in time nf - poly(n) - 
log(1/8). (Recall that 2* < poly(n).) Since f is a k-junta, all of its Fourier 
coefficients are either 0 or at least 27% in magnitude; hence we can exactly 
identify the sets § for which fis) # 0. For any such S, all of the coordinates 
i € Sare relevant for f (Exercise 2.11). So unless FS) = Oforall1 < |S| < d, 
we can find a relevant coordinate for f in time n@ - poly(n) - log(1/5) (except 
with probability at most 6/2). 

To complete the proof of Theorem 6.36 it remains to handle the case that 
FS) = 0 for all 1 < |S| < d; i.e., f is dth-order correlation immune. In this 
case, by Siegenthaler’s Theorem we know that degy, (f) < k — d. (Note that 
d < k since f is not constant.) But there is a learning algorithm running in time 
in time O(n)! - log(1/8) that exactly learns any F2-polynomial of degree at 
most £ (except with probability at most 6/2). Roughly speaking, the algorithm 
draws O(n)‘ random examples and then solves an F-linear system to determine 
the coefficients of the unknown polynomial; see Exercise 6.30 for details. Thus 
in time n>“—® . poly(n) - log(1/6) this algorithm will exactly determine f, and 
in particular find a relevant coordinate. 

By choosing d = [3K] we balance the running time of the two algorithms. 
Regardless of whether f is dth-order correlation immune, at least one of the 
two algorithms will find a relevant coordinate for f (except with probability 
at most 5/2 + 6/2 = ô) in time n@/** . poly(n) - log(1/6). This completes the 
proof of Theorem 6.36. 

Our next application of pseudorandomness involves using €-biased dis- 
tributions to give a deterministic version of the Goldreich—Levin Algorithm 
(and hence the Kushilevitz—Mansour learning algorithm) for functions f with 
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small {| f Îi. We begin with a basic lemma showing that you can get a good 
estimate for the mean of such functions using an €-biased distribution: 


Lemma 6.38. If f : {-1, 1} — Rand ọ : {-1, 1}" — R is an €-biased den- 
sity, then 


E Iœ- EL] < if fle. 


This lemma follows from Proposition 6.13.(1), but we provide a separate proof: 
Proof. By Plancherel, 
ELO = p f)=fO+ AF), 


SAD 


and the difference of this from E[ f] = FØ) is, in absolute value, at most 


YAH! f(D] <€- OFM < Fhe. 


SAG SAD 


Since Î f Îi < ÎS 1; (Exercise 3.6), we also have the following immediate 


corollary: 


Corollary 6.39. If f : {—1, 1” —> R and ọ : {-1, 1}” — R is an e€-biased 
density, then 


A LRI 
1 


EMEI- EL < isl 


E. 


We can use the first lemma to get a deterministic version of Proposition 3.30, 
the learning algorithm that estimates a specified Fourier coefficient. 


Proposition 6.40. There is a deterministic algorithm that, given query access 
to a function f : {—1, 1}" —> Ras wellas U C [n], O < € < 1/2, ands > 1, 
outputs an estimate f (U) for f (U) satisfying 


If) - FI < €, 
provided if < s. The running time is poly(n, s, 1/€). 


Proof. It suffices to handle the case U = Ø because for general U, the 
algorithm can simulate query access to f - xy with poly(m) overhead, and 
F- xu) = fU ). The algorithm will use Theorem 6.30 to construct an (€/s)- 
biased density ø that is uniform over a (multi-)set of cardinality O(n?s?/e€7). By 
enumerating over this set and using queries to f, it can deterministically output 
the estimate fl (Ø) = E,~g[ f (x)] in time poly(n, s, 1/e). The error bound now 
follows from Lemma 6.38. 
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The other key ingredient needed for the Goldreich-Levin Algorithm was 
Proposition 3.40, which let us estimate 


WU = DFSUTY = EB [fie] (6.6) 


TOI 


for any S C J C [n]. Observe that for any z € {—1, Ly we can use Proposi- 
tion 6.40 to deterministically estimate Fi(S) to accuracy +e. The reason is 
that we can simulate query access to the restricted function Fiii the (€/s)- 
biased density y remains (€/s)-biased on {—1, 1}’, and most importantly 
Vfnelly < [fi], < s by Exercise 3.7. It is not much more difficult to deter- 


ministically estimate (6.6): 


Proposition 6.41. There is a deterministic algorithm that, given query access 
to a function f : {—1, 1}" > {-1, 1} as well asSCJC[n],0<€ < 1/2, 
and s > 1, outputs an estimate B for WS { f] that satisfies 


WTF] — Bl < €, 
provided IBAR < s. The running time is poly(n, s, 1/€). 


Proof. Recall the notation F s7 f from Definition 3.20; by (6.6), the algorithm’s 
task is to estimate E, 1 y7l(Fs7f)°@)]. If  : {-1, 1} > R= is an %- 
biased density, Corollary 6.39 tells us that 


EIE PO- E Esr O s AEsgf h 


where the second inequality is immediate from Proposition 3.21. We now show 
the algorithm can approximately compute E,~,[(Fs7 f Y(z)]. For each z € 
{-1 1, 1}, the algorithm can use ¢ to deterministically estimate (F siT fXz)= 

Fiz (S) to within ts- 75 < $ in poly(n, s, 1/e) time, just as was described 
in the text none (6.6). Since | Friz 7\z(S)| < 1, the square of this estimate is 
within, say, 3 of (Fy 7 f )°(z). Hence by enumerating over the support of ¢, the 
algorithm ou in deterministic poly(n, s, 1/€) time estimate E,~,[(F SIT f )?(z)] 
to within + H% , which by (6.7) gives an estimate to within +e of the desired 


quantity E1, p7 [F s7 f). 


Propositions 6.40 and 6.41 are the only two ingredients needed for a deran- 
domization of the Goldreich-Levin Algorithm. We can therefore state a deran- 
domized version of its corollary Theorem 3.38 on learning functions with small 
Fourier 1-norm: 
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Theorem 6.42. Let €= {f : {—1, 1}" > {-1,1} | I ff], < s}. Then @ is 
deterministically learnable from queries with error € in time poly(n, s, 1/€). 


Since any f : {—1, 1}” > {—1, 1} with sparsity(f) < s also has Î f Îi < s, 
we may also deduce from Exercise 3.37(c): 


Theorem 6.43. Let €= {f :{—1, 1} > {-1,]}| sparsity( f) SIW; 
Then © is deterministically learnable exactly (0 error) from queries in time 
poly(n, 2%). 


Example functions that fall into the concept classes of these theorems are deci- 
sion trees of size at most s, and decision trees of depth at most k, respectively. 


We conclude this section by discussing a derandomized version of the Blum- 
Luby—Rubinfeld linearity test from Chapter 1.6: 


Derandomized BLR Test. Given query access to f : F3 > Fo: 


(1) Choose x ~ F; and y ~ ọ, where is an €-biased density. 
(2) Query f atx, y, and x + y. 
(3) “Accept” if f(x) + fly) = fœ + y). 


Whereas the original BLR Test required exactly 2n independent random 
bits, the above derandomized version needs only n + O(log(n/e)). This is very 
close to minimum possible; a test using only, say, .99n random bits would only 
be able to inspect a 2~-°!” fraction of f’s values. 

If f is F2-linear then it is still accepted by the Derandomized BLR Test with 
probability 1. As for the approximate converse, we’ll have to make a slight 
concession: We’ll show that any function accepted with probability close to 1 
must be close to an affine function, i.e., satisfy degp,(f) < 1. This concession is 
necessary: the function f : F — Fz mightbe 1 everywhere except on the (tiny) 
support of ¢. In that case the acceptance criterion f(x) + f(y) = f(x + y) will 
almost always be 1 + 0 = 1; yet f is very far from every linear function. It is, 
however, very close to the affine function 1. 


Theorem 6.44. Suppose the Derandomized BLR Test accepts f : F3 —> F2 
with probability 5 + 50. Then f has correlation at least /0* — € with some 
affine g : F} > Fy; i.e., dist(f, g) < 4 — 5V6? — €. 


Remark 6.45. The bound in this theorem works well both when 6 is close to 0 
and when 9 is close to 1; e.g., for 0 = 1 — 26 we get that if f is accepted with 
probability 1 — ô, then f is nearly 5-close to an affine function, provided € < ô. 
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Proof. As in the analysis of the BLR Test (Theorem 1.30) we encode f’s 
outputs by +1 € R. Using the first few lines of that analysis we see that our 
hypothesis is equivalent to 


0< EIOS E= ESO * NO 


Igp 


By Cauchy—Schwarz, 


E [fO F * NPON < I E OI E [(f * FPO) 
y~y y~y y~y 
= | E [f * FPO), 
y~y 
and hence 


9 < El(f* fr E x frl+ife fhe =) fy +e, 


yeF; 


where the inequality is Corollary 6.39 and we used Fx F) = fy. The 
conclusion of the proof is as in the original analysis (cf. Proposition 6.7, 
Exercise 1.29): 


8? -e <} fy s max fo) DF = max fy}, 
yes yes 


vef vef 


and hence there exists y* such that fy! > V0? —e. 


6.5. Highlight: Fooling F2-Polynomials 


Recall that a density ọ is said to be €-biased if its correlation with every F2- 
linear function f is at most € in magnitude. In the lingo of pseudorandomness, 
one says that ọ fools the class of F-linear functions: 


Definition 6.46. Let y : Fi —> R= be a density function and let @ be a class 
of functions F} — R. We say that 9 €-fools € if 


E [f@)]- E [f(*)]] <e 
y~e x~F; 


forall fe @. 


Theorem 6.30 implies that using just O(log(n/e)) independent random bits, 
one can generate a density that €-fools the class of f : Få —> {—1, 1} with 
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degp, (f) < 1. A natural problem in the field of derandomization is: How many 
independent random bits are needed to generate a density which €-fools all 
functions of F-degree at most d? A naive hope might be that €-biased densities 
automatically fool functions of F2-degree d > 1. The next example shows that 
this hope fails badly, even for d = 2: 


Example 6.47. Recall the inner product mod 2 function, IP, : F} — {0, 1}, 
which has F,-degree 2. Let ọ : F} > R= be the density of the uniform dis- 
tribution on the support of IP,,. Now IP, is an extremely regular function (see 
Example 6.4), and indeed g is a roughly 2~”/?-biased density (see Exercise 6.7). 
But ¢ is very bad at fooling at least one function of F,-degree 2, namely IP, 
itself: 


E [IP,(x)] © 1/2, E [IP,(y)] = 1. 
x~F; y~e 


The problem of using few random bits to fool n-bit, F.-degree-d 
functions was first taken up by Luby, Veličković, and Wigderson (Luby 
et al., 1993). They showed how to generate a fooling distribution 
using exp(O(./d log(n/d) + log(1/e))) independent random bits. There was 
no improvement on this for 14 years, at which point Bogdanov and 
Viola (Bogdanov and Viola, 2007) achieved O(log(n/e)) random bits for 
d = 2 and O(logn) + exp(poly(1/e)) random bits for d = 3. In general, they 
suggested that F-degree-d functions might be fooled by the sum of d inde- 
pendent draws from a small-bias distribution. Soon thereafter Lovett (Lovett, 
2008) showed that a sum of 2¢ independent draws from a small-bias distri- 
bution suffices, implying that F-degree-d functions can be fooled using just 
2° . log(n/e) random bits. More precisely, if g is any €-biased density on 
F3, Lovett showed that 


OP HHI EO Sle"). 
(4 Pe ne 


d 
yD, yen 


In other words, the 2¢-fold convolution gr" density fools functions of F2- 
degree d. 

The current state of the art for this problem is Viola’s Theorem, which 
shows that the original idea of Bogdanov and Viola (Bogdanov and Viola, 
2007) works: Summing d independent draws from an e€-biased distribution 
fools F,-degree-d polynomials. 


Viola’s Theorem. Let ọ be any €-biased density on F}, O <€< 1. Let 
d € N* and define e4 = 9€!” '. Then the class of all f : F} —> {—1, 1} with 
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degr, (f) < d is €a-fooled by the d-fold convolution o*l; ie., 


SOPHO- E O < 9. 
yD yO ~p x~F; 

In light of Theorem 6.30, Viola’s Theorem implies that one can €-fool n-bit 
functions of F;-degree d using only O(d logn) + O(d2¢ log(1/e)) independent 
random bits. 

The proof of Viola’s Theorem is an induction on d. To reduce the case of 
degree d + 1 to degree d, Viola makes use of a simple concept: directional 
derivatives. 


Definition 6.48. Let f : F} — Fz and let y € F}. The directional derivative 
Ay f : F} — F; is defined by 


Ay fx) = fæ + y)— fF). 
Over Fz we may equivalently write Ay f(x) = f(x + y) + f(x). 
As expected, taking a derivative reduces degree by 1: 


Fact 6.49. For any f:F}—> F, and ye Fs we have degy,(Ayf) < 
degr, (f) — 1. 


In fact, we’ll prove a slightly stronger statement: 


Proposition 6.50. Let f : F} —> F, have degy,(f) =d and fix y, y' € F}. 
Define g:F) > F, by ga)=f(xt+y)—ft+y’). Then degp,(g) < 
d-1. 


Proof. In passing from the F2-polynomial representation of f (x) to that of g(x), 
each monomial x° of maximal degree d is replaced by (x + y)5 — (x + y’)®. 
Upon expansion the monomials x* cancel, leaving a polynomial of degree at 


most d — 1. 


We are now ready to give the proof of Viola’s Theorem. 


Proof of Viola’s Theorem. The proof is by induction on d. The d = 1 case is 
immediate (even without the factor of 9) because o is €-biased. Assume that the 
theorem holds for general d > 1 and let f : F} — {—1, 1} have degy,(f) < 
d + 1. We split into two cases, depending on whether the bias of f is large or 
small. 
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Case 1: EL ff? > €q. In this case, 


val, E 


g*a +D 


O- E Lo 


< |ELf]I- 


O- Eo 


ete) 


=| Ey SOSO- E SEF] 


x ~F! z~ gett) 


=| E EIDION- E E+ 


y~F},z~g*d+D 


=| a Epul- E A0) 


YOR! z~g*d+D) 


e [|E 
y~F; 


For each outcome y = y the directional derivative A, f has F2-degree at most d 


IA 


*(d+1) 


afo- E afol]. 


gp 


(Fact 6.49). By induction we know that y*“ e4-fools any such polynomial, and 
it follows from Exercise 6.29 that g*¢+) does too. Thus each quantity in the 
expectation over y is at most €q, and we conclude 


FO- EIO $e = VE Seas Seas 


Les == 


Case 2: E[f]? < €g. In this case we want to show that Ey~g«ern[ f (w)? is 
nearly as small. By Cauchy—Schwarz, 


rar = E [Eet] = eect] 


Kena 
= ELE ee eae E | Ee+ysetyl]. 
z~o“ VY ~elz~g* 


For each outcome of y= y, y'= y’, the function f(z + y)f(z + y’) is of 
F2-degree at most d in the variables z, by Proposition 6.50. Hence by induction 
we have 


E [LEM etosfety i]s BLE Le + fe + yl] +64 


y.y ~o yy ~ F; 


= E [p> fy 1+ ea 


X WYP FY + ea 


ye 
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< fOr +e fy +e 


v#0 


<2€y + e?, 
where the last step used the hypothesis of Case 2. We have thus shown 


E [f(w)P < 2eq +? < 3a < 4ea, 


w~g* +1) 


and hence | E[ f(w)]| < 2./eg. Since we are in Case 2, | E[ f]| < /€a, and so 


[f(w)] — EL fl} < 3./ea = eat, 


w~g* d+) 


as needed. 


We end this section by discussing the tightness of parameters in Viola’s 
Theorem. First, if we ignore the error parameter, then the result is sharp: Lovett 
and Tzur (Lovett and Tzur, 2009) showed that the d-fold convolution of €- 
biased densities cannot in general fool functions of F,-degree d + 1. More 
precisely, for any d € Nt, £ > 2d + 1 they give an explicit £ -biased density 
on FS*)" and an explicit function f : FYT” —> {—1, 1} of degree d + 1 for 
which 


2d 
JE, wekil 


2n ` 
Regarding the error parameter in Viola’s Theorem, it is not known whether the 
quantity €!/ 27 can be improved, even in the case d = 2. However, obtaining 
even a modest improvement to €!/ 1.9% (for d as large as log n) would constitute 
a major advance since it would imply progress on the notorious problem of 
“correlation bounds for polynomials”; see Viola (Viola, 2009). 


6.6. Exercises and Notes 


6.1 Let f be chosen as in Proposition 6.1. Compute Var[ f(S)] for each 
S C [n]. 

6.2 Prove Fact 6.8. 

6.3 Show that any nonconstant k-junta has Inf ™®[ f] > (1/2 — 8/2}! /k 
for at least one coordinate i. 


6.4 Leto : F} > R= be an e-biased density. For each d € N* show that the 
d-fold convolution g*“ is an €4-biased density. 
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6.5 (a) Show that if f : {—1, 1}” —> R has ¢-small influences, then it is //e- 
regular. 

(b) Show that for all even n there exists f : {—1, 1}” —> {—1, 1} that is 
2-"/?-regular but does not have €-small influences for any € < 1/2. 

(c) Show that there is a function f :{—1,1}' > {-1,1} with 
((1 — 6)"~!, 6)-small stable influences that is not €-regular for any 
e<l. 

(d) Verify that the function f(x) = xọMaj„(x1,..., Xn) from Exam- 
ple 6.10 satisfies nf} ~”[ f] = Stab,_s[Maj,] for 6 € (0, 1), and thus 
does not have (e€, 6)-small stable influences unless € > 1 — V. 

(e) Show that the function f : {—1, 1}"+! > {—1, 1} from part (d) is 
Jp Tegular. 

(f) Suppose f : {—1, 1}” —> R has (e, 5)-small stable influences. Show 
that f is (n, k)-regular for n = y€/(1 — 5)*!. 

(g) Show that f has (e€, 1)-small stable influences if and only if f is 
(ve, 1)-regular. 

(h) Let f : {-1, 1}” — {-1, 1} be monotone. Show that if f is (e, 1)- 
regular then f is €-regular and has €-small influences. 

6.6 (a) Let f : {—1,1}" > R. Let (J, J) be a partition of [n] and let 
zée{-l, 17. For z ~ {-1, ined uniformly random, give a formula 
for Var-,[E[ f7|,]] in terms of f’s Fourier coefficients. (Hint: Direct 
application of Corollary 3.22.) 

(b) Using the above formula and the probabilistic method, give an alter- 
nate proof of the second statement of Proposition 6.12. 

6.7 Let g : F} > R= be the density corresponding to the uniform distri- 
bution on the support of IP,, : F} — {0, 1}. Show that ø is €-biased for 
e= 2="/2/(1 —2-"/?), but not for smaller €. 

6.8 Prove Proposition 6.13. 

6.9 Compute the F2-polynomial representation of the equality function Equ, : 
{0, 1}” — {0, 1}, defined by Equ, (x) = 1 if and only if x; = x2 =---=Xy. 

6.10 (a) Let f : {0, 1}” — R and let g(x) = J sct] csx5 be the (unique) 
multilinear polynomial representation of f over R. Show that cs = 
X gcs(— 15SIRI F(R), where we identify R C [n] with its 0-1 indi- 
cator string. This formula is sometimes called Möbius inversion. 

(b) Prove Proposition 6.21. 

6.11 (Cf. Lemma 3.5.) Let f : F; —> Fz be nonzero and suppose degg, (f) < k. 

Show that Pr[ f(x) 4 0] > 27%, (Hint: As in the similar Exercise 3.4, use 

induction on n.) 
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6.12 Let f : {—1, 1} — {0, 1}. 

(a) Show that degp,(f) < log(sparsity(f)). (Hint: You will need Exer- 
cise 3.7, Corollary 6.22, and Exercise 1.3.) 

(b) Suppose f is 2~*-granular. Show that degr, (f) < k. (This is a 
stronger result than part (a), by Exercise 3.32.) 

6.13 Let f : {—1, 1}” > {—1, 1} be bent, n > 2. Show that degg, (f) < n/2. 

(Note that the upper bound n/2 + 1 follows from Exercise 6.12(b).) 

6.14 In this exercise you will prove Theorem 6.25. 

(a) Suppose p(x) = co + csx + r(x) is a real multilinear polynomial 
over x1, .. . , Xn With co, cs Æ 0, | S| > in, and |7'| > in for all mono- 
mials x’ appearing in r(x). Show that after expansion and multilinear 
reduction (meaning x? e 1), p(x)? contains the term 2cocsx°. 

(b) Deduce Theorem 6.25. 


6.15 In this exercise you will explore the sharpness of Siegenthaler’s Theorem 
and Theorem 6.25. 
(a) Foralln andk < n — 1, findan f : {0, 1}” — {0, 1} that is k-resilient 
and has degg, (f) =n —k—1. 
(b) Foralln > 3, findan f : {0, 1}” — {0, 1} thatis 1st-order correlation 
immune and has degy,(f) =n — 1. 
(c) For all n divisible by 3, find a biased f : {0, 1}” — {0, 1} that is 
(3n — 1)th-order correlation immune. 
6.16 Prove Proposition 6.27. 


6.17 Bent functions come in pairs: Show that if f: F5 — {—1, 1} is bent, then 
2”/2 fi is also a bent function (with domain F a): 

6.18 Extend Proposition 6.29 to show that if x is any permutation on F}, then 
F(x, y) = IPan (x, w(y))g(y) is bent. 

6.19 Dickson’s Theorem says the following: Any polynomial p : F} —> F, of 
degree at most 2 can be expressed as 


k 
P(x) = lolx) + D> GOLA), (6.8) 


j=l 


where £o is an affine function and £1, €),..., €x, &, are linearly indepen- 
dent linear functions. Here k depends Be on p and is called the “rank” 
of p. Show that for n even, g : F5} — {—1, 1} defined by g(x) = x(p(x)) 
is bent if and only if k = n/2, if and only if g arises from IP, as in 
Proposition 6.28. 
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6.21 


6.22 


6.23 


6.24 


6.25 


6 Pseudorandomness and F2-Polynomials 


Without appealing to Dickson’s Theorem, prove that the complete 
quadratic x => ae x;x; can be expressed as in (6.8), with k = 
[n/2]. (Hint: Induction on n, with different steps depending on the parity 
of n.) 

Define mod; : {—1, 1}” —> {0, 1} by mod3(x) = 1 if and only if SS Xi 
is divisible by 3. Derive the Fourier expansion 


mod3(x) = $+ $(—1/2)" (=)! mod 4)/2,/3!5| 8 


SC[n] 
|S] even 


and conclude that the function mods is 2(3)"-regular. (Hint: Consider 


Tja1(-3 + ¥3)x;.) 

In Theorem 6.30, show that given r, s any fixed bit y; can be obtained in 

deterministic poly(£) time. 

(a) Slightly modify the construction in Theorem 6.30 to obtain a 
(2~' — 2~)-biased density. (Hint: Arrange for p, to have degree 
at most n — 1.) 

(b) Since Fx is a dimension- vector space over F>, it has some basis 
U1, ..., Ue. Suppose we modify the construction in Theorem 6.30 
so that g is a density on F3°, with y;; = (enc(v;r'), enc(s)) for 
i € [n], j € [£]. Show that g remains 2~'-biased. 

Fix € € (0, 1) and n € N. Let A C F} be a randomly chosen multiset 

in which [Cn/e?] elements are included, independently and uniformly. 

Show that if C is a large enough constant, then A is €-biased except with 

probability at most 27”. 

Consider the problem of computing the matrix multiplication C = AB, 

where A, B € F,*". There is an algorithm (Stothers, 2010; Vassilevska 

Williams, 2012) for solving this problem in time O(n”), where w < 

2.373; however, the algorithm is very complicated. Suppose you are given 

A, B, and the outcome C’ of running this algorithm; you want to test that 

indeed C’ = AB. 

(a) Give an algorithm using n random bits and time O(n’) with the 
following property: If C’ = AB, then the algorithm “accepts” with 
probability 1; if C’ Æ AB, then the algorithm “accepts” with prob- 
ability at most 1/2. (Hint: Compute C’x and ABx for a random 
x € F3.) 

(b) Show how to reduce the number of random bits used to O(log n) at the 
expense of making the false acceptance probability 2/3, while keeping 


6.26 


6.27 


6.28 


6.29 


6.30 
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the running time O(n’). (You may use the fact that in Theorem 6.30, 
the time required to compute y given r and s is n - polylog(£).) 

Simplify the exposition and analysis of Theorem 6.32 and Corollary 6.33 

in the case of k = 2, and show that you can take m to be one less (i.e., 

m = £). 

Consider the matrix H’ € pesa constructed in Theorem 6.32, and suppose 

we delete all rows corresponding to even (nonzero) powers of the a;’s. 

Show that H’ retains the property that any sum of at most k columns of H’ 

is nonzero in F£. (Hint: Prove and use that (X j B; y= D j B; for any 

sequence of 6; € F,,.) Deduce that the cardinality of A in Corollary 6.33 

can be decreased to 2(2n)!*/2!, 

Let A C F} be a multiset and suppose that the probability density 4 

is k-wise independent. In this exercise you will prove the lower bound 

|A| > Q(n“/7J) (for k constant). 

(a) Suppose F © 2™ is a collection of subsets of [n] such that |S UT | < k 
for all S, T € F. For each S € F define x? € {—1, 1}'4! C RI^ to 
be the real vector with entries indexed by A whose ath entry is 
a5 =[];-5a;. Show that the set of vectors (mxs : S € F} is 
orthonormal and hence |A| > |F]. 

(b) Show that we can find ¥ satisfying |F| > X 


IF| > Eae () + lna) if k is odd. 
Let @ be a class of functions F} — R that is closed under translation; 
i.e., ft € @ whenever f € @ and z € F} (recall Definition 3.24). An 
example is the class of functions of F2-degree at most d. Show that if y 


is a density that €-fools 6, then y * g also €-fools @ for any density Q. 


k/2 


j0 C) if k is even and 


Fix an integer £ > 1. In this exercise you will generalize Exercise 3.43 by 

showing how to exactly learn F2-polynomials of degree at most £. 

(a) Fix p: F} — F} with degg,(p) < £ and suppose that x“),..., 
x™ ~ F} are drawn uniformly and independently from F}. Assume 
that m > C - 2°(n* + log(1/6)) for 0 < ô < 1/2 and C a sufficiently 
large constant. Show that except with probability at most ô, the only 
q : F} > F, with degp,(q) < £ that satisfies g(x) = p(x) for all 
i € [m] is q = p. (Hint: Exercise 6.11 with q — p.) 


(b) Show that the concept class of all polynomials F5 — Fz of degree 


ma 


at most £ can be learned from random examples only, with error 0, 
in time O(n)”. (Remark: As in Exercise 3.43, since the key step is 
solving a linear system, the learning algorithm can also be done in 
O(n)® time, assuming matrix multiplication can be done in O(n”) 
time.) 
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(c) Extend this learning algorithm so that in running time O(n)" . 
log(1/5) it achieves success probability at least 1 — ô. (Hint: Sim- 
ilar to Exercise 3.40.) 

In this exercise you will prove Lemma 6.37. 

(a) Give a poly(n, 2%) - log(1/6)-time learning algorithm that, given ran- 
dom examples from a k-junta F5 — F2, determines (except with 
probability at most ô) if f is a constant function, and if so, which one. 

(b) Given access to random examples from a k-junta f : F} —> Fo, let 
P C [n] be a set of relevant coordinates for f and let z € F? . Show 
how to obtain M independent random examples from the (k — | P|)- 
junta fp), in time poly(n, 2') - M - log(1/6) (except with probability 
at most ô). 

(c) Complete the proof of Lemma 6.37. (Hint: Build a depth-k decision 
tree for f.) 

(a) Improve the bound in Lemma 6.38 to Î f ];¢ — | fle and the bound 
in Corollary 6.39 to if ive — II f lže. 

(b) Improve the bound in Theorem 6.44 to V8? — €/./1 — €. 

Improve on Theorem 6.44 by a factor of roughly 2 in the case of acceptance 

probability near 1. Specifically, show that if f passes the Derandomized 

BLR Test with probability 1 — ô, then there exists y* € P with | Fi (y*)| = 

V1—25—€//1—e. 

Fix an integer k € N+. Let (f, )se{o,1}* be a collection of functions indexed 

by length-k binary sequences, each fs : F} — R. Define the kth Gowers 

“inner product” ((f;)s)yx € R by 


(u=, E | TT fet D yi], 


Dyed 
Perk) se(0, 1} iat 


where the k + 1 random vectors x y,,..., y, are independent and uni- 
formly distributed on F}. Define the kth Gowers norm of a function 
f : FS > Rby 


fla = (F f Pe, 


where (f, f,..., f) denotes that all 2* functions in the collection equal f. 
(You will later verify that ((f, f,..., f))y« is always nonnegative.) 

(a) Check that (fo, fiju: = EL fo] EL fi] and therefore eal = E[f}’. 
(b) Check that 


(foo, fio, for, finiju? = X foo holy fol) fry) ful) 


yer 
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and therefore || f l$ = Î file. (Cf. Exercise 1.29(b).) 


(c) Show that 


(fsoe=) E JE! [|| e+ E yi 


Jooo Yk- ES 
i k=l S:Sk=0 i:s;=l 


xE| [[ se+ Z y|] 69 


s:sk=1 i:s;=1 


where x’ is independent of x, y,,..., Y,_, and uniformly distributed. 
(d) Show that ((f, f,..., f))y« is always nonnegative, as promised. 
(e) Using (6.9) and Cauchy—Schwarz, show that 


(fs)s)uk < VCS) u V C eias D)s)u. 
(f) Show that 


ou < JI fu (6.10) 


se{0,1} 


(g) Fixing f : F} —> R, show that || fllu: < |f llys. (Hint: Consider 
(fs)sefo,1jet1 defined by fs = f if sk}1 = 0 and fs = 1 if sk}1 = 1.) 

(h) Show that || - ||y« satisfies the triangle inequality and is therefore a 
seminorm. (Hint: First show that 


fot All = XO (Cfisesvsefo.)u" 


SC{O,1}* 


and then use (6.10).) 
(i) Show that || - ||y« is in fact a norm for all k > 2; i.e., || fllu: = 0 => 


f=0. 


Notes 


The F,-polynomial representation of a Boolean function f is often called its alge- 
braic normal form. It seems to have first been explicitly introduced by Zhegalkin in 
1927 (Zhegalkin, 1927). 

For functions f : Z, —> R, the idea of €-regularity as a pseudorandomness notion 
dates back to Chung and Graham (Chung and Graham, 1992), as does the equiva- 
lent combinatorial condition Proposition 6.7. (In the context of quasirandom graphs, 
the ideas date further back to Thomason (Thomason, 1987) and to Chung, Graham, 
and Wilson (Chung et al., 1989).) The idea of treating functions with small (sta- 
ble) influences as being “generic” has its origins in the work of Kahn, Kalai, and 
Linial (Kahn et al., 1988). The notion was brought to the fore in work on hardness of 
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approximation — implicitly, by Hastad (Hastad, 1996, 1999), and later more explicitly 
by Khot, Kindler, Mossel, and O’ Donnell (Khot et al., 2007). 

The notion of €-biased sets (and also (€, k)-wise independent distributions) was 
introduced by Naor and Naor (Naor and Naor, 1993) (see also the independent work of 
Peralta (Peralta, 1990)). The construction in Theorem 6.30 is due to Alon, Goldreich, 
HAstad, and Peralta (Alon et al., 1992) (as is Exercise 6.23). As noted by Naor and 
Naor (Naor and Naor, 1993), €-biased sets are closely related to error-correcting codes 
over F3; indeed, they are equivalent to linear error-correcting in which all pairs of code- 
words have relative distance in [4 — te, $ + te]. In particular, the construction in The- 
orem 6.30 is the concatenation of the well-known Reed-Solomon and Hadamard codes 
(see, e.g., MacWilliams and Sloane (MacWilliams and Sloane, 1977) for definitions). 
The nonconstructive upper bound in Exercise 6.24 is essentially the Gilbert-Varshamov 
bound and is close to known lower bound of aea) (assuming € > 272%), which 
follows from the work of McEliece, Rodemich, Rumsey, and Welch (McEliece et al., 


1977) (see (MacWilliams and Sloane, 1977)). Additionally, constructive upper bounds of 


O(3)and O( a) are known using tools from coding theory; see the work of Ben-Aroya 
and Ta-Shma (Ben-Aroya and Ta-Shma, 2009) and Matthews and Peachey (Matthews 
and Peachey, 2011). 

The probabilistic notion of correlation immunity — i.e., condition (2) of Corollary 
6.14 — was first introduced by Siegenthaler (Siegenthaler, 1984); we further discuss his 
work below. Independently and shortly thereafter, Chor, Friedman, Goldreich, Hastad, 
Rudich, and Smolensky (Chor et al., 1985) introduced the definition of resilience and also 
connected it to (0, )-regularity of the Fourier spectrum; i.e., they proved Corollary 6.14. 
(In the cryptography literature, Corollary 6.14 is called the Xiao—Massey Theorem (Xiao 
and Massey, 1988).) The work (Chor et al., 1985) also essentially contains Theorem 6.25 
and the relevant function from Example 6.16; cf. the work of Mossel et al. (Mossel et al., 
2004). 

The problem of constructing explicit k-wise distributions of small support arose in 
different guises in different areas — in the study of orthogonal arrays (in statistics), 
error-correcting codes, and algorithmic derandomization. Alon, Babai, and Itai (Alon 
et al., 1985) gave the construction in Theorem 6.32 — in fact, the stronger one from 
Exercise 6.27 — based on the analysis of dual BCH codes in MacWilliams and 
Sloane (MacWilliams and Sloane, 1977). The lower bound from Exercise 6.28 is essen- 
tially due to Rao (Rao, 1947); see also independent proofs (Chor et al., 1985; Alon et al., 
1985). 

Siegenthaler’s Theorem dates from 1984 (Siegenthaler, 1984). His motivation was 
the study of cryptographic stream ciphers in cryptography. In this application, a short 
random sequence of bits (“secret key”) is transformed via some scheme into a very 
long sequence of pseudorandom bits (“keystream”), which can then be used as a one- 
time pad for encryption. A basic component of most schemes is a linear feedback 
shift register (LFSR), which can efficiently generate long, fairly statistically-uniform 
sequences. However, due to its F,-linearity, it suffers from some simple cryptanalytic 
attacks. An early idea for combating this is to take n independent LFSR streams and 
combine them via some function f : F} —> F). Effective attacks are possible in such a 
scheme if f is correlated with any of its input bits — or indeed (as Siegenthaler pointed 
out) any input pair, triple, etc. This led Siegenthaler to define the probabilistic notion 
of correlation-immunity. Although x,,, is the maximally correlation-immune function, 
it is not suitable as a LFSR combining function precisely because of its F -linearity; 
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the same is true of any function of low F,-degree. Siegenthaler precisely captured this 
tradeoff between correlation-immunity and F -degree in his theorem. 

Bent functions were named and first studied by Rothaus around 1966; he didn’t pub- 
lish the notion until 1976, however (Rothaus, 1976), at which point there were already 
several works on subject, see, e.g., (Dillon, 1972). Bent functions have application in 
cryptography and coding theory; see, e.g., Carlet’s survey (Carlet, 2010). The basic 
constructions presented in Section 6.3 are due to Rothaus; the class of bent functions 
described in Exercise 6.18 is called the Maiorana—McFarland family. Dickson’s Theo- 
rem is from a 1901 publication (Dickson, 1901, Theorem 199); see also Mac Williams 
and Sloane (MacWilliams and Sloane, 1977, Theorem 15.4). 

Theorem 6.36 is from Mossel et al. (Mossel et al., 2004); there is an improved 
algorithm for learning k-juntas that runs in time roughly n-°***poly(n), due to Gregory 
Valiant (Valiant, 2012). Avrim Blum offers a prize of $1,000 for solving the case of 
k = log logn in poly(n) time (Blum, 2003). Theorem 6.42 is due to Kushilevitz and Man- 
sour (Kushilevitz and Mansour, 1993). The Derandomized BLR Test and Theorem 6.44 
(and Exercise 6.32) are due to Ben-Sasson, Sudan, Vadhan, and Wigderson (Ben-Sasson 
et al., 2003). 

The result of Exercise 6.11 is due to Muller (Muller, 1954a, Theorem 6); deriving 
Exercise 6.30 from it and from Blumer et al. (Blumer et al., 1987) is folklore. The result of 
Exercise 6.12(a) is due to Bernasconi and Codenotti (Bernasconi and Codenotti, 1999); 
Exercise 6.13 is from MacWilliams and Sloane (MacWilliams and Sloane, 1977). In 
Exercise 6.25, part (a) is due to Freivalds (Freivalds, 1979) and part (b) to Naor and 
Naor (Naor and Naor, 1993). The Gowers norm and results of Exercise 6.34 are from 
Gowers (Gowers, 2001). Our proof of the second statement in Proposition 6.12 was 
suggested by Noam Lifshitz. 


7 
Property Testing, PCPPs, and CSPs 


In this chapter we study several closely intertwined topics: property testing, 
probabilistically checkable proofs of proximity (PCPPs), and constraint satis- 
faction problems (CSPs). All of our work will be centered around the task of 
testing whether an unknown Boolean function is a dictator. We begin by extend- 
ing the BLR Test to give a 3-query property testing algorithm for the class of 
dictator functions. This in turn allows us to give a 3-query testing algorithm for 
any property, so long as the right “proof” is provided. We then introduce CSPs, 
which are in fact identical to string testing algorithms. Finally, we explain 
how dictator tests can be translated into computational complexity results for 
CSPs, and we sketch the proofs of some of Hastad’s optimal inapproximability 
results. 


7.1. Dictator Testing 


In Chapter 1.6 we described the BLR property testing algorithm: Given query 
access to an unknown function f : {0, 1}” — {0, 1}, this algorithm queries f on 
a few random inputs and approximately determines whether f has the property 
of being linear over F2. The field of property testing for Boolean functions 
is concerned with coming up with similar algorithms for other properties. In 
general, a “property” can be any collection @ of n-bit Boolean functions; it’s 
the same as the notion of “concept class” from learning theory. Indeed, before 
running an algorithm to try to learn an unknown f € ©, one might first run a 
property testing algorithm to try to verify that indeed f E€ 6. 

Let’s encapsulate the key aspects of the BLR linearity test with some defi- 
nitions: 
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Definition 7.1. An r-query function testing algorithm for Boolean functions 
f : {0, 1)” — {0, 1} is a randomized algorithm that: 


e chooses r (or fewer) strings x, ..., x e€ {0, 1}” according to some prob- 
ability distribution; 

e queries f(x), ..., f(x); 

e based on the outcomes, decides (deterministically) whether to “accept” f. 


Definition 7.2. Let @ be a “property” of n-bit Boolean functions, i.e., a col- 
lection of functions {0, 1}” — {0, 1}. We say a function testing algorithm is a 
local tester for 6 (with rejection rate à > 0) if it satisfies the following: 


e If f € E, then the tester accepts with probability 1. 

e For all 0 < € < 1, if dist( f, €) > e (in the sense of Definition 1.29), then 
the tester rejects f with probability greater than A - €. 
Equivalently, if the tester accepts f with probability at least 1 — 2 - €, then 
f is €-close to 6, i.e., 3g € @ such that dist( f, g) < €. 


By taking € = 0 in the above definition you see that any local tester gives 
a characterization of @: a function is in @ if and only if it is accepted by 
the tester with probability 1. But a local tester furthermore gives a “robust” 
characterization: Any function accepted with probability close to 1 must be 
close to satisfying 6. 


Example 7.3. By Theorem 1.30, the BLR Test is a 3-query local tester for the 
property € = {f : F} — F, | f is linear} (with rejection rate 1). 


Remark 7.4. To be pedantic, the BLR linearity test is actually a family of local 
testers, one for each value of n. This is acommon scenario: We will usually be 
interested in testing natural families of properties (6n )nen+, where G,, contains 
functions {0, 1}” — {0, 1}. In this case we need to describe a family of testers, 
one for each n. Generally, these testers will “act the same” for all values of n 
and will have the property that the rejection rate à > 0 is a universal constant 
independent of n. 


There are a number of standard variations of Definition 7.2 that one could 
consider. One variation is to allow for an adaptive testing algorithm, meaning 
that the algorithm can decide how to generate x“ based on the query outcomes 
f(x), ..., f(x"). However, in this book we will only consider nonadaptive 
testing. Another variation is to relax the requirement that ¢-far functions be 
rejected with probability Q(€); one could allow for smaller rates such as Q(e?), 
or Q(e/logn). For simplicity, we will stick with the strict demand that the 
rejection probability be linear in e. Finally, the most common definition of 
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property testing allows the number of queries to be a function r(e) of € but 
requires that any function €-far from @ be rejected with probability at least 1/2. 
This is easier to achieve than satisfying Definition 7.2; see Exercise 7.1. 

So far we have seen that the property of being linear over F, is locally 
testable. We’ll now spend some time discussing local testability of an even 
simpler property, the property of being a dictator. In other words, we’ll consider 
the property 


D= {f :{0,1}" > {0, 1} | f(x) = x; for some i € [n]}. 


As we will see, dictatorship is in some ways the most important property to be 
able to test. 

We begin with a reminder: Even though 2 is a subclass of the linear functions 
and we have a local tester for linearity, this doesn’t mean we automatically have 
a local tester for dictatorship. (This is in contrast to learning theory, where a 
learning algorithm for a concept class automatically works for any subclass.) 
The reason is that the non-dictator linear functions — i.e., xs for |S| # 1 — are 
at distance 5 from 2 but are accepted by any linearity test with probability 1. 

Still, we could use a linearity test as a first component of a test for dicta- 
torship; this essentially reduces the problem to testing if an unknown linear 
function is a dictator. Historically, the first local testers for dictatorship (Bellare 
et al., 1995; Parnas et al., 2001) worked this way; after testing linearity, they 
chose x, y ~ {0, 1}” uniformly and independently, set z = x ^ y (the bitwise 
logical AND), and tested whether f(z) = f(x) A f(y). The idea is that the 
only parity functions that satisfy this “AND test” with probability 1 are the 
dictators (and the constant 0). The analysis of the test takes a bit of work; see 
Exercise 7.8 for details. 

Here we will describe a simpler dictatorship test. Recall we have already 
seen an important result that characterizes dictatorship: Arrow’s Theorem, 
from Chapter 2.5. Furthermore the robust version of Arrow’s Theorem (Corol- 
lary 2.60) involves evaluating a 3-candidate Condorcet election under the impar- 
tial culture assumption, and this is the same as querying the election rule f 
on 3 correlated random inputs. This suggests a dictatorship testing component 
we call the “NAE Test”: 


NAE Test. Given query access to f : {—1, 1}" > {-1, 1}: 


e Choose x, y,z € {—1, 1}” by letting each triple (xi, y;, zi) be drawn inde- 
pendently and uniformly at random from among the 6 triples satisfying the 
not-all-equal predicate NAE; : {—1, 1} > {0, 1}. 

e Query f atx, y, z. 

° Accept if NAE3(f (x), f(y), f(Z)) is satisfied. 
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The NAE Test by itself is almost a 3-query local tester for the property of 
being a dictator. Certainly if f is a dictator then the NAE Test accepts with 
probability 1. Furthermore, in Chapter 2.5 we proved: 


Theorem 7.5 (Restatement of Corollary 2.60). If the NAE Test accepts f with 
probability 1 — e, then W'[f] > 1 — 36, and hence f is O(€)-close to +x; 
for some i € [n] by the FKN Theorem. 


There are two slightly unsatisfactory aspects to this theorem. First, it gives 
a local tester only for the property of being a dictator or a negated-dictator. 
Second, though the deduction W!'[f] > 1 — z€ requires only simple Fourier 
analysis, the conclusion that f is close to a (negated-)dictator relies on the 
non-trivial FKN Theorem. Fortunately we can fix both issues simply by adding 
in the BLR Test: 


Theorem 7.6. Given query access to f :{—1, 1} —> {—1, 1}, perform both 
the BLR Test and the NAE Test. This is a 6-query local tester for the property 
of being a dictator (with rejection rate .1). 


Proof. The first condition in Definition 7.2 is easy to check: If f : {—1, 1} > 
{—1, 1} is a dictator, then both tests accept f with probability 1. To check the 
second condition, fix 0 < € < 1 and assume the overall test accepts f with 
probability at least 1 — .le. Our goal is to show that f is €-close to some 
dictator. 

Since the overall test accepts with probability at least 1 — .le, both the 
BLR and the NAE tests must individually accept f with probability at least 
1 — .le. By the analysis of the NAE Test we deduce that W'[ f] > 1 — 3 -le = 
1 — .45e. By the analysis of the BLR Test (Theorem 1.30) we deduce that f 
is .le-close to some parity function; i.e., fiS*) > ] — .2e for some S* C [n]. 
Now if |S*| 4 1 we would have 


l= Sow isl > (1 — 45€) + (1 — Bey > 2 — 85e > 1, 
k=0 


a contradiction. Thus we must have |S*| = 1 and hence f is .le-close to the 
dictator xs», stronger than what we need. 


As you can see, we haven’t been particularly careful about obtaining the 
largest possible rejection rate. Instead, we will be more interested in using as 
few queries as possible (while maintaining some positive constant rejection 
rate). Indeed we now show a small trick which lets us reduce our 6-query 
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local tester for dictatorship down to a 3-query one. This is best possible since 
dictatorship can’t be locally tested with 2 queries (see Exercise 7.6). 


BLR+NAE Test. Given query access to f : {—1, 1}" > {-1, 1}: 


e With probability 1/2, perform the BLR Test on f. 
e With probability 1/2, perform the NAE Test on f. 


Theorem 7.7. The BLR+NAE Test is a 3-query local tester for the property of 
being a dictator (with rejection rate .05). 


Proof. The only observation we need to make is that if the BLR+NAE Test 
accepts with probability 1 — .05e then both the BLR and the NAE tests individ- 
ually must accept f with probability at least 1 — .le. The result then follows 
from the analysis of Theorem 7.6. 


Remark 7.8. In general, this trick lets us take the maximum of the query com- 
plexities when we combine tests, rather than the sum (at the expense of worsen- 
ing the rejection rate). Suppose we wish to combine t = O(1) different testing 
algorithms, where the ith tester uses r; queries. We make an overall test that 
performs each subtest with probability 1/t. This gives a max(r1, ..., 7;)-query 
testing algorithm with the following guarantee: If the overall test accepts f 
with probability 1 — że then every subtest must accept f with probability at 
least 1 — rE. 


We can now explain one reason why dictatorship is a particularly important 
property to be able to test locally. Given the BLR Test for linear functions it 
still took us a little thought to find a local test for the subclass 2 of dictators. 
But given our dictatorship test, it’s easy to give a 3-query local tester for any 
subclass of 2. (On a related note, Exercise 7.15 asks you to give a 3-query local 
tester for any affine subspace of the linear functions.) 


Theorem 7.9. Let Abe any subclass of n-bit dictators; i.e., let S C [n] and let 
S= {xi : {0, 1} > {0, 1} |i € $}. 
Then there is a 3-query local tester for F (with rejection rate .01). 


Proof. Let 1s € {0, 1}” denote the indicator string for the subset S. Given 
access to f : {0, 1}” — {0, 1}, the test is as follows: 


e With probability 1/2, perform the BLR+NAE Test on f. 
e With probability 1/2, apply the local correcting routine of Proposition 1.31 
to f on string 1s; accept if and only if the output value is 1. 
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This test always makes either 2 or 3 queries, and whenever f € it accepts 
with probability 1. Now let 0 < e < 1 and suppose the test accepts f with 
probability at least 1 — Ae, where à = .01. Our goal will be to show that f is 
€-close to a dictator x; with i € S. 

Since the overall test accepts f with probability at least 1 — àe, the 
BLR+NAE Test must accept f with probability at least 1 — 2e. By Theo- 
rem 7.7 we may deduce that f is 40A€-close to some dictator x;. Our goal is to 
show that i € S; this will complete the proof because 40A€ < e (by our choice 
of A = .01). 

So suppose by way of contradiction that i ¢ S; i.e., x;(1s) = 0. Since f is 
40A€-close to the parity function x;, Proposition 1.31 tells us that 


Pr[locally correcting f on input 1s produces the output x;(1s) = 0] 
> 1- 80e. 


On the other hand, since the overall test accepts f with probability at least 
1 — àe, the second subtest must accept f with probability at least 1 — 2A€. 
This means 


Pr[locally correcting f on input 1s produces the output 0] < 24e. 


But this is a contradiction, since 2A€ < 1 — 80e for all O < € < 1 (by our 
choice of A = .01). Hence i € S as desired. 


7.2. Probabilistically Checkable Proofs of Proximity 


In the previous section we saw that every subproperty of the dictatorship 
property has a 3-query local tester. In this section we will show that any 
property whatsoever has a 3-query local tester — if an appropriate “proof” is 
provided. 

To make sense of this statement let’s first generalize the setting in which 
we study property testing. Definitions 7.1 and 7.2 are concerned with testing 
a Boolean function f : {0, 1}” — {0,1} by querying its values on various 
inputs. If we think of f’s truth table as a Boolean string of length N = 2”, then 
a testing algorithm simply queries various coordinates of this string. It makes 
sense to generalize to the notion of testing properties of N-bit strings, for any 
length N. Here a property @ will just be a collection € C {0, 1}% of strings, 
and we’ll be concerned with the relative Hamming distance dist(w, w’) = 
vw Aw, w’) between strings. For simplicity, we'll begin to write n instead 
of N. 
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Definition 7.10. An r-query string testing algorithm for strings w € {0, 1}” is 
a randomized algorithm that: 


e chooses r (or fewer) indices i1, ..., i, € [n] according to some probability 
distribution; 
e queries wWj,,..., Wi,; 


e based on the outcomes, decides (deterministically) whether to “accept” w. 


We may also generalize this definition to testing strings w € Q” over finite 
alphabets Q of cardinality larger than 2. 


Definition 7.11. Let € C {0, 1}” be a “property” of n-bit Boolean strings. We 
say a String testing algorithm is a local tester for € (with rejection rate à > 0) 
if it satisfies the following: 


e If w e $, then the tester accepts with probability 1. 

e Forall0 < e < 1,ifdist(w, €) > €, then the tester rejects w with probability 
greater than A - e€. 
Equivalently, if the tester accepts w with probability at least 1 — À - €, then w 
is €-close to @; i.e., Jw’ € @ such that dist(w, w’) < €. 


Example 7.12. Let Z = {(0,0,...,0)} C {0, 1}” be the property of being 
the all-zeroes string. Then the following is a 1-query local tester for Z (with 
rejection rate 1): Pick a uniformly random index i and accept if w; = 0. 

Let € = {(0, 0,..., 0), (1, 1,..., I} © {0, 1}” be the property of having all 
coordinates equal. Then the following is a 2-query local tester for £: Pick two 
independent and uniformly random indices i and j and accept if w; = wj. 
In Exercise 7.4 you are asked to show that if dist(w, &) = €, then this tester 
rejects w with probability 5 E 4a — 26e} >e. 

Let © = {w € F3 : w has an odd number of 1’s}. This property does not 
have a local tester making few queries. In fact, in Exercise 7.5 you are 
asked to show that any local tester for @ must make the maximum number of 


queries, n. 


As the last example shows, not every property has a local tester making a 
small number of queries; indeed, most properties of n-bit strings do not. This 
is rather too bad: Imagine that for any large n and any complicated property 
€ C {0, 1}” there were an O(1)-query local tester. Then if anyone supplied 
you with a string w claiming it satisfied @, you wouldn’t have to laboriously 
check this yourself, nor would you have to trust the supplier; you could simply 
spot-check w in a constant number of coordinates and become convinced that w 
is (close to being) in 6. 
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But what if, in addition to w € {0, 1}”, you could require the supplier to give 
you some additional side information TI € {0, 1} about w so as to assist you in 
testing that w € @? One can think of TI as a kind of “proof” that w satisfies @. 
In this case it’s possible that you can spot-check w and TI together in a constant 
number of coordinates and become convinced that w is (close to being) in 
€ — all without having to “trust” the supplier of the string w and the purported 
proof IT. These ideas lead to the notion of probabilistically checkable proofs of 
proximity (PCPPs). 


Definition 7.13. Let @ C {0, 1}” be a property of n-bit Boolean strings and let 
£ € N. We say that € has an r-query, length-£ probabilistically checkable proof 
of proximity (PCPP) system (with rejection rate A > 0) when the following 
holds: There exists an r-query testing algorithm T for (n + £)-bit strings, 
thought of as pairs w € {0, 1}” and TI € {0, 1}°, such that: 


e (“Completeness.”) If w € @, then there exists a “proof” TI € {0, 1} such 
that T accepts with probability 1. 

e (“Soundness.”) For all 0 < e < 1, if dist(w, @) > €, then for every “proof” 
TI € {0, 1}* the tester T rejects with probability greater than À - €. 
Equivalently, if there exists TI € {0, 1}* that causes T to accept with proba- 
bility at least 1 — à - €, then w must be €-close to @. 


PCPP systems are also known as assisted testers, locally testable proofs, or 
assignment testers. 


Remark 7.14. A word on the three parameters: We are usually interested in 
fixing the number of queries r to a very small universal constant (such as 3) 
while trying to keep the proof length £ = ¢(n) relatively small (e.g., poly(7) is 
a good goal). We are usually not very concerned with the rejection rate A so 
long as it’s a positive universal constant (independent of n). 


Example 7.15. In Example 7.12 we stated that @ = {w e F5 : wı +--+ 
Wn = 1} has no local tester making fewer than n queries. But it’s easy to give a 
3-query PCPP system for @ with proof length n — 1 (and rejection rate 1). The 
idea is to require the proof string TI to contain the partial sums of w: 


j+1 
M= wi (mod 2). 


i=1 
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The tester will perform one of the following checks, uniformly at random: 


IM = w + w2 
Tl, = I + w3 
I = IM + w4 


Mn- = Wn-2 + Wn 
M,- = 1 


Evidently the tester always makes at most 3 queries. Further, in the “complete- 
ness” case w € @, if TI is a correct list of partial sums then the tester will accept 
with probability 1. It remains to analyze the “soundness” case, w ¢ O. Here we 
are significantly aided by the fact that dist(w,@) must be exactly 1/n (since 
every string is at Hamming distance either 0 or 1 from ©). Thus to confirm the 
claimed rejection rate of 1, we only need to observe that if w ¢ @ then at least 
one of the tester’s n checks must fail. 


This example generalizes to give a very efficient PCPP system for testing 
that w satisfies any fixed F2-linear equation. What about testing that w satisfies 
a fixed system of F-linear equations? This interesting question is explored in 
Exercise 7.16, which serves as a good warmup for our next result. 

We now extend Theorem 7.9 to show the rather remarkable fact that any 
property of n-bit strings has a 3-query PCPP system. (The proof length, how- 
ever, is enormous.) 


Theorem 7.16. Let C {0, 1}” be any class of strings. Then there is a 3-query, 
length-27" PCPP system for 6 (with rejection rate .001). 


Proof. Let N = 2” and fix an arbitrary bijection ¢ : {0, 1}” —> [N]. The tester 
will interpret the string w € {0, 1}” to be tested as an index (w) € [N] and 
will interpret the 2% -length proof TI as a function TI : {0, 1}% > {0, 1}. The 
idea is for the tester to require that IT be the dictator function corresponding to 
index (w); i.e., Xw) : {0, 1} —> {0, 1}. 

Now under the identification 4, we can think of the string property @ as a 
subclass of all N-bit dictators, namely 


= {Xw : (0, 1}% > {0,1} | w E 4. 


In particular, @ is a property of N-bit functions. We can now state the twofold 
goal of the tester: 
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(1) check that I € @; 
(2) given that II is indeed some dictator Xw : {0, 1} — {0,1} with 
w € E, check that w’ = w. 


To accomplish the latter the tester would like to check w; = wi for a random 
J € [n]. The tester can query any w; directly but accessing w, requires a little 
thought. The trick is to prepare the string 


X e€ {0, 1}% defined by Xi) = y;. 


and then to locally correct I on XO’ (using Proposition 1.31). 
Thus the tester is defined as follows: 


(1) With probability 1/2, locally test the function property @ using Theo- 
rem 7.9. 

(2) With probability 1/2, pick j ~ [n] uniformly at random; locally correct 
TI on the string X“ and accept if the outcome equals w ji 


Note that the tester makes 3 queries in both of the subtests. 

Verifying “completeness” of this PCPP system is easy: if w € @ and II is 
indeed the (truth table of) Xw) : {0, 1} — {0, 1} then the test will accept with 
probability 1. It remains to verify the “soundness” condition. Fix w € {0, 1}”, 
T : {0, 1}% — {0, 1}, and 0 < € < 1 and suppose that the tester accepts (w, TI) 
with probability at least 1 — àe, where à = .001. Our goal is to show that w is 
€-close to some string w’ € ©. 

Since the overall test accepts with probability at least 1 — A€, subtest (1) 
above accepts with probability at least 1 — 2A€. Thus by Theorem 7.9, TI must 
be 200A €-close to some dictator Xw) With w’ € @. Since dictators are parity 
functions, Proposition 1.31 tells us that 


Vj, Pr[locally correcting TI on X}? produces Xw (X O) = w] 
> 1 — 400e > 1/2, (7.1) 
where we used 400A€ < 400A < 1/2 by the choice A = .001. 


On the other hand, since the overall test accepts with probability at least 
1 — àe, subtest (2) above rejects with probability at most 2e. This means 


A ; [ Pr[locally correcting TI on XY doesn’t produce w jl] < 2re. 
J~[n 


By Markov’s inequality we deduce that except for at most a 4/€ fraction of 
coordinates j € [n] we have 


Pr[locally correcting TI on X“ doesn’t produce w j] < 1/2. 
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Combining this information with (7.1) we deduce that w; = w, except for at 
most a 4,A€ < e€ fraction of coordinates j € [n]. Since w’ € @ we conclude that 
dist(w, C) < €, as desired. 


You may feel that the doubly-exponential proof length 27” in this theorem is 
quite bad, but bear in mind there are 2?” different properties @. Actually, giving 
a PCPP system for every property is a bit overzealous since most properties are 
not interesting or natural. A more reasonable goal would be to give efficient 
PCPP systems for all “explicit” properties. A good way to formalize this is 
to consider properties decidable by polynomial-size circuits. Here we use the 
definition of general (De Morgan) circuits from Exercise 4.13. Given an n- 
variable circuit C we consider the set of strings which it “accepts” to be a 


property, 
€= {w € {0, 1}” : C(w) = 1}. (7.2) 


For properties computed by modest-sized circuits C we may hope for PCPP 
systems with proof length much less than 27’. We saw such a case in Exam- 
ple 7.15. 

Another advantage of considering “explicit” properties is that we can define 
a notion of constructing a PCPP system, “given” a property. A theorem of the 
form “for each explicit property @ there exists an efficient PCPP system...” 
may not be useful, practically speaking, if its proof is nonconstructive. We can 
formalize the issue as follows: 


Definition 7.17. A PCPP reduction is an algorithm which takes as input a 
circuit C and outputs the description of a PCPP system for the string prop- 
erty @ decided by C as in (7.2), where n is the number of inputs to C. If the 
output PCPP system always makes r queries, has proof length £(n, size(C)) 
(for some function £), and has rejection rate à > 0, we say that the PCPP 
reduction has the same parameters. Finally, the PCPP reduction should run in 
time poly(size(C), £). 


(We haven’t precisely specified what it means to output the description of a 
PCPP system; this will be explained more carefully in Section 7.3. In brief it 
means to list — for each possible outcome of the tester’s randomness — which 
bits are queried and what predicate of them is used to decide acceptance.) 

Looking back at the results on testing subclasses of dictatorship (Theo- 
rem 7.9) and PCPPs for any property (Theorem 7.16) we can see they have the 
desired sort of “constructive” proofs. In Theorem 7.9 the local tester’s descrip- 
tion depends in a very simple way on the input ls. As for Theorem 7.16, it 
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suffices to note that given an n-input circuit C we can write down its truth 
table (and hence the property it decides) in time poly(size(C)) - 2”, whereas the 
allowed running time is at least poly(size(C), 27"). Hence we may state: 


Theorem 7.18. There exists a 3-query PCPP reduction with proof length 27" 
(and rejection rate .001). 


In Exercise 7.18 you are asked to improve this result as follows: 


Theorem 7.19. There exists a 3-query PCPP reduction with proof length 
2polysiz(©)) (and positive rejection rate). 


(The fact that we again have just 3 queries is explained by Exercise 7.12; 
there is a generic reduction from any constant number of queries down 
to 3.) 

Indeed, there is a much more dramatic improvement: 


The PCPP Theorem. There exists a3-query PCPP reduction with proof length 
poly(size(C)) (and positive rejection rate). 


This is (a slightly strengthened version of) the famous “PCP Theo- 
rem” (Feige et al., 1996; Arora and Safra, 1998; Arora et al., 1998) from 
the field of computational complexity, which is discussed later in this chapter. 
Though the PCPP Theorem is far stronger than Theorem 7.18, the latter is 
not unnecessary; it’s actually an ingredient in Dinur’s proof of the PCP Theo- 
rem (Dinur, 2007), being applied only to circuits of “constant” size. The current 
state of the art for PCPP length (Dinur, 2007; Ben-Sasson and Sudan, 2008) is 
highly efficient: 


Theorem 7.20. There exists a 3-query PCPP reduction with proof length 
size(C) - polylog(size(C)) (and positive rejection rate). 


7.3. CSPs and Computational Complexity 


This section is about the computational complexity of constraint satisfaction 
problems (CSPs), a fertile area of application for analysis of Boolean functions. 
To study it we need to introduce a fair bit of background material; in fact, this 
section will mainly consist of definitions. 

In brief, a CSP is an algorithmic task in which a large number of “variables” 
must be assigned “labels” so as to satisfy given “local constraints”. We start by 
informally describing some examples: 
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Example 7.21. 


In the “Max-3-Sat” problem, given is a CNF formula of width at most 3 over 
Boolean variables x1, ...,x,. The task is to find a setting of the inputs that 
satisfies (i.e., makes True) as many clauses as possible. 

In the “Max-Cut” problem, given is an undirected graph G = (V, E). The 
task is to fund a “cut” — i.e., a partition of V into two parts — so that as many 
edges as possible “cross the cut”. 

e In the “Max-E3-Lin” problem, given is a system of linear equations over F >, 


each equation involving exactly 3 variables. The system may in general 
be overdetermined; the task is to find a solution which satisfies as many 
equations as possible. 

In the “Max-3-Coloring” problem, given is an undirected graph G = (V, E). 
The task is to color each vertex either red, green, or blue so as to make as 
many edges as possible bichromatic. 


Let’s rephrase the last two of these examples so that the descriptions have 
more in common. In Max-E3-Lin we have a set of variables V, to be assigned 
labels from the domain Q = F,. Each constraint is of the form vı + v2 + 
v3 = Qor vy + v2 + v3 = 1, where v1, v2, v3 € V. In Max-3-Coloring we have 
a set of variables (vertices) V to be assigned labels from the domain Q = 
{red, green, blue}. Each constraint (edge) is a pair of variables, constrained to 
be labeled by unequal colors. 

We now make formal definitions which encompass all of the above examples: 


Definition 7.22. A constraint satisfaction problem (CSP) over domain Q is 
defined by a finite set of predicates (“types of constraints”) W, with each 
w € W being of the form y : Q” — {0, 1} for some arity r (possibly different 
for different predicates). We say that the arity of the CSP is the maximum arity 
of its predicates. 


Such a CSP is associated with an algorithmic task called “Max-CSP(V)”, 
which we will define below. First, though, let us see how the CSPs from 
Example 7.21 fit into the above definition. 


e Max-3-Sat: Domain Q = {True, False}; W contains 14 predicates: the 8 logi- 
cal OR functions on 3 literals (variables/negated-variables), the 4 logical OR 
functions on 2 literals, and the 2 logical OR functions on | literal. 

e Max-Cut: Domain Q = {—1, 1}, Y = {+}, the “not-equal” predicate 
#:{-1, 1? > {0, 1}. 

e Max-E3-Lin: Domain Q = F2; W contains two 3-ary predicates, 
(x1, X2, X3) œ> xı + X2 + x3 and (x1, x2, X3) xy + x2 + x3 + 1. 


7.3. CSPs and Computational Complexity 175 


e Max-3-Coloring: Domain Q = {red, green, blue}; Y contains just the single 
not-equal predicate 4: Q? — {0, 1}. 


Remark 7.23. Let us add a few words about traditional CSP terminology. 
Boolean CSPs refer to the case |Q2| = 2. If y : {—1, 1}” — {0, 1} is a Boolean 
predicate we sometimes write “Max-y’” to refer to the CSP where all constraints 
are of the form y applied to literals; i.e., Y = {w(4uy,..., +v,)}. As an 
example, Max-E3-Lin could also be called Max-y/3}. The “E3” in the name 
Max-E3-Lin refers to the fact that all constraints involve “E”xactly 3 variables. 
Thus e.g. Max-3-Lin is the generalization in which 1- and 2-variable equations 
are allowed. Conversely, Max-E3-Sat is the special case of Max-3-Sat where 
each clause must be of width exactly 3 (a CSP which could also be called 
Max-OR3). 


To formally define the algorithmic task Max-CSP(W), we begin by defining 
its input: 


Definition 7.24. An instance (or input) P of Max-CSP(W) over variable set V 
is a list (multiset) of constraints. Each constraint C € Pis a pair C = (S, Y), 
where y € W and where the scope S = (v!,...,v") is a tuple of distinct 
variables from V, with r being the arity of y. We always assume that each 
v € V participates in at least one constraint scope. The size of an instance is 
the number of bits required to represent it; writing n = |V| and treating |Q], 
|W| and the arity of Y as constants, the size is between n and O(|A| logn). 


Remark 7.25. Let’s look at how the small details of Definition 7.24 affect 
input graphs for Max-Cut. Since an instance is a multiset of constraints, this 
means we allow graphs with parallel edges. Since each scope must consist of 
distinct variables, this means we disallow graphs with self-loops. Finally, since 
each variable must participate in at least one constraint, this means input graphs 
must have no isolated vertices (though they may be disconnected). 


Given an assignment of labels for the variables, we are interested in the num- 
ber of constraints that are “satisfied”. The reason we explicitly allow duplicate 
constraints in an instance is that we may want some constraints to be more 
important than others. In fact it’s more convenient to normalize by looking at 
the fraction of satisfied constraints, rather than the number. Equivalently, we 
can choose a constraint C ~ #uniformly at random and look at the probability 
that it is satisfied. It will actually be quite useful to think of a CSP instance Yas 
a probability distribution on constraints. (Indeed, we could have more generally 
defined weighted CSPs in which the constraints are given arbitrary nonnegative 
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weights summing to 1; however, we don’t want to worry about the issue of 
representing, say, irrational weights with finitely many bits.) 


Definition 7.26. An assignment (or labeling) for instance Y of Max-CSP(W) 
is just a mapping F : V —> Q. For constraint C = (S, Y) € A we say that F 
satisfies C if Y(F(S)) = 1. Here we use shorthand notation: if S = (v!,..., v”) 
then F(S) denotes (F(v!), ..., F(v")). The value of F, denoted Val g(F), is 
the fraction of constraints in Y that F satisfies: 


Valg@(F)= E [wcir(S))] e€ [0, 1]. (7.3) 
S pP 
The optimum value of # is 
t = Val (F)}. 
Op(P) = max {ValgA(F)} 
If Opt(P) = 1, we say that Zis satisfiable. 
Remark 7.27. In the literature on CSPs there is sometimes an unfortunate 


blurring between a variable and its assignment. For example, a Max-E3-Lin 
instance may be written as 


xı +x. + x3 = 0 
xy + xs + x6 =0 


x3 + x4 + x6 = l; 


then a particular assignment x; = 0, x2 = 1, x3 = 0, x4 = 1, x5 = 1, x6 = 1 
may be given. Now there is confusion: Does x2 represent the name of a variable 
or does it represent 1? Because of this we prefer to display CSP instances with 
the name of the assignment F present in the constraints. That is, the above 
instance would be described as finding F : {x1, . . . , x6} — Fz so as to satisfy 
as many as possible of the following: 


F(x) + F(x2) + F(x3) = 0 
F(x1) + F(vs) + F(x) = 0 
F(x3) + F(x4) + F(%6) = 1, 
Finally, we define the algorithmic task associated with a CSP: 


Definition 7.28. The algorithmic task Max-CSP(W) is defined as follows: The 
input is an instance Y. The goal is to output an assignment F with as large a 
value as possible. 
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Having defined CSPs, let us make a connection to the notion of a string 
testing algorithm from the previous section. The connection is this: CSPs and 
string testing algorithms are the same object. Indeed, consider a CSP instance Y 
over domain Q with n variables V. Fix an assignment F : V > Q; we can also 
think of F as a string in Q” (under some ordering of V). Now think of a 
testing algorithm which chooses a constraint (S, Y) ~ #at random, “queries” 
the string entry F(v) for each v € S, and accepts if and only if the predicate 
w(F(S)) is satisfied. This is indeed an r-query string testing algorithm, where 
r is the arity of the CSP; the probability the tester accepts is precisely Valga(F’). 

Conversely, let T be some randomized testing algorithm for strings in Q”. 
Assume for simplicity that T’s randomness comes from the uniform distribution 
over some sample space U. Now suppose we enumerate all outcomes in U, 
and for each we write the tuple of indices S that T queries and the predicate 
Wy : Q'S! —> {0, 1} that T uses to make its subsequent accept/reject decision. 
Then this list of scope/predicates pairs is precisely an instance of an n-variable 
CSP over Q. The arity of the CSP is equal to the (maximum) number of 
queries that T makes and the predicates for the CSP are precisely those used 
by the tester in making its accept/reject decisions. Again, the probability that 
T accepts a string F € Q” is equal to the value of F as an assignment for the 
CSP. (Our actual definition of string testers allowed any form of randomness, 
including, say, irrational probabilities; thus technically not every string tester 
can be viewed as a CSP. However, it does little harm to ignore this technicality.) 

In particular, this equivalence between string testers and CSPs lets us prop- 
erly define “outputting the description of a PCPP system” as in Definition 7.17 
of PCPP reductions. 


Example 7.29. The PCPP system for © = {w e€ Fz : wi +--+ wn, = 1} 
given in Example 7.15 can be thought of as an instance of the Max-3-Lin 
CSP over the 2n — 1 variables {w,.. , II,_1}. The BLR linear- 
ity test for functions F} — F, can also be thought of as instance of Max-3-Lin 
over 2” variables (recall that function testers are string testers). In this case we 
identify the variable set with F5; if n = 2 then the variables are named (0, 0), 
(0, 1), (1, 0), and (1, 1); and, if we write F : FS — F; for the assignment, the 
instance is 


<, Wy, T1y,... 


F(0,0) + F(0, 0) + F(0, 0) = 0 
F(0,0) + F(0, 1) + F(0, 1) =0 
F(0,0) + F(1, 0) + F(1,0) =0 


F(0,0)+ F0, 1)+ FU, 1) =0 


F(0, 1) + F(0,0) + F(0, 1) =0 
F(0, 1) + F(0, 1) + F(0,0) =0 
F(0,1)+ F(1,0 + FU, 1) =0 


F(0,1)+ FU, 1)+ FU,0) =0 


F(1, 0) + F(0, 0) + FU, 0) = 0 
F(1,0) + F(0,1)+ Fd,1)=0 
F(1, 0) + F(1, 0) + FO, 0) = 0 


F(1,0)+ F(1, 1) + FO, 1) = 0 


Cf. Remark 7.27; also, note the duplicate constraints. 


F(1,1)+ F(0,0)+ F, 1) =0 
F(1,1)+ F(0,1)+ F(1,0)=0 
F(1,1)+ F(1,0)+ F0,1)=0 


F(1, 1) + F(1, 1) + F(0, 0) = 0. 
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We end this section by discussing the computational complexity of finding 
high-value assignments for a given CSP — equivalently, finding strings that 
make a given string tester accept with high probability. Consider, for example, 
the task of Max-Cut on n-vertex graphs. Of course, given a Max-Cut instance 
one can always find the optimal solution in time roughly 2”, just by trying all 
possible cuts. Unfortunately, this is not very efficient, even for slightly large 
values of n. In computational complexity theory, an algorithm is generally 
deemed “efficient” if it runs in time poly(n). For some subfamilies of graphs 
there are poly(n)-time algorithms for finding the maximum cut, e.g., bipartite 
graphs (Exercise 7.14) or planar graphs. However, it seems very unlikely that 
there is a poly()-time algorithm that is guaranteed to find an optimal Max-Cut 
assignment given any input graph. This statement is formalized by a basic 
theorem from the field of computational complexity: 


Theorem 7.30. The task of finding the maximum cut in a given input graph is 
“NP-hard”. 


We will not formally define NP-hardness in this book (though see Exer- 
cise 7.13 for some more explanation). Roughly speaking it means “at least as 
hard as the Circuit-Sat problem”, where “Circuit-Sat” is the following task: 
Given an n-variable Boolean circuit C, decide whether or not C is satisfiable 
(i.e., there exists w € {0, 1}” such that C(w) = 1). It is widely believed that 
Circuit-Sat does not have a polynomial-time algorithm (this is the “P 4 NP” 
conjecture). In fact it is also believed that Circuit-Sat does not have a 2°™-time 
algorithm. 

For essentially all CSPs, including Max-E3-Sat, Max-E3-Lin, and Max-3- 
Coloring, finding an optimal solution is NP-hard. This motivates considering a 
relaxed goal: 


Definition 7.31. Let 0 < a < $ < 1. We say that algorithm A is an (a, B)- 
approximation algorithm for Max-CSP(W) (pronounced “a out of 6 approxi- 
mation”) if it has the following guarantee: on any instance with optimum value 
at least 6, algorithm A outputs an assignment of value at least œ. In case A is 
a randomized algorithm, we only require that its output has value at least œ in 
expectation. 


A mnemonic here is that when the Best assignment has value £, the algorithm 
gets value a. 


Example 7.32. Consider the following algorithm for Max-E3-Lin: Given 
an instance, output either the assignment F = 0 or the assignment F = 1, 
whichever has higher value. Since either 0 or | occurs on at least half of the 
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instance’s “right-hand sides”, the output assignment will always have value at 
least L, Thus this is an efficient G, ß)-approximation algorithm for any £. In 
the case B = 1 one can do better: performing Gaussian elimination is an effi- 
cient (1, 1)-approximation algorithm for Max-E3-Lin (or indeed Max-r -Lin for 
any r). 

As a far more sophisticated example, Goemans and Williamson (Goemans 
and Williamson, 1995) showed that there is an efficient (randomized) algorithm 
which (.8788, 8)-approximates Max-Cut for every £. 


Not only is finding the optimal solution of a Max-E3-Sat instance NP-hard, 
it’s even NP-hard on satisfiable instances. In other words: 


Theorem 7.33. (1, 1)-approximating Max-E3Sat is NP-hard. The same is true 
of Max-3-Coloring. 


On the other hand, it’s easy to (1, 1)-approximate Max-3-Lin (Example 7.32) 
or Max-Cut (Exercise 7.14). Nevertheless, the “textbook” NP-hardness results 
for these problems imply the following: 


Theorem 7.34. (8, 6)-approximating Max-E3-Lin is NP-hard for any fixed 
Be G, 1). The same is true of Max-Cut. 


In some ways, saying that (1, 1)-distinguishing Max-E3-Sat is NP-hard is not 
necessarily that disheartening. For example, if (1 — 6, 1)-approximating Max- 
E3-Sat were possible in polynomial time for every 5 > 0, you might consider 
that “good enough”. Unfortunately, such a state of affairs is very likely ruled 
out: 


Theorem 7.35. There exists a positive universal constant 59 > 0 such that 
(1 — ôo, 1)-approximating Max-E3-Sat is NP-hard. 


In fact, Theorem 7.35 is equivalent to the “PCP Theorem” mentioned in 
Section 7.2. It follows straightforwardly from the PCPP Theorem, as we now 
sketch: 


Proof sketch. Let 59 be the rejection rate in the PCPP Theorem. We want 
to show that (1 — ôo, 1)-approximating Max-E3-Sat is at least as hard as the 
Circuit-Sat problem. Equivalently, we want to show that if there is an efficient 
algorithm A for (1 — ôo, 1)-approximating Max-E3-Sat then there is an efficient 
algorithm B for Circuit-Sat. So suppose A exists and let C be a Boolean 
circuit given as input to B. Algorithm B first applies to C the PCPP reduction 
given by the PCPP Theorem. The output is some arity-3 CSP instance Y over 
variables w1, ..., Wn, Ii, ..., He, where £ < poly(size(C)). By Exercise 7.12 
we may assume that # is an instance of Max-E3-Sat. From the definition of 
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a PCPP system, it is easy to check (Exercise 7.19) the following: If C is 
satisfiable then Opt(Y) = 1; and, if C is not satisfiable then Opt() < 1 — ôo. 
Algorithm B now runs the supposed (1 — do, 1)-approximation algorithm A 
on ¥ and outputs “C is satisfiable” if and only if A finds an assignment of 
value at least 1 — do. 
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In Theorem 7.35 we saw that it is NP-hard to (1 — ôo, 1)-approximate Max- 
E3Sat for some positive but inexplicit constant 59. You might wonder how large 
ôo can be. The natural limit here is $ because there is a very simple algorithm 
that satisfies a 4-fraction of the constraints in any Max-E3Sat instance: 


Proposition 7.36. Consider the Max-E3-Sat algorithm that outputs a uniformly 
random assignment F. This is a G, B)-approximation for any B. 


Proof. In instance Y, each constraint is a logical OR of exactly 3 literals and 
will therefore be satisfied by F with probability exactly i. Hence in expectation 
the algorithm will satisfy a ¢-fraction of the constraints. 


(It’s also easy to “derandomize” this algorithm, giving a deterministic guarantee 
of at least z of the constraints, see Exercise 7.21.) 

This algorithm is of course completely brainless — it doesn’t even “look 
at” the instance it is trying to approximately solve. But rather remarkably, it 
achieves the best possible approximation guarantee among all efficient algo- 
rithms (assuming P Æ NP). This is a consequence of the following 1997 theorem 
of Hastad (Hastad, 2001b), improving significantly on Theorem 7.35: 


Hastad’s 3-Sat Hardness. For any constant ô > 0, it is NP-hard to G +ô, 1)- 
approximate Max-E3-Sat. 


Håstad gave similarly optimal hardness-of-approximation results for several 
other problems, including Max-E3-Lin: 


Hastad’s 3-Lin Hardness. For any constant 8> 0, it is NP-hard to 
G + 6, 1 — ô)-approximate Max-E3-Lin. 


In this hardness theorem, both the “a” and “6” parameters are optimal; 
as we saw in Example 7.32 one can efficiently G, ß)-approximate and also 
(1, 1)-approximate Max-E3-Lin. 
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The goal of this section is to sketch the proof of the above theorems, mainly 
HAstad’s 3-Lin Hardness Theorem. Let’s begin by considering the 3-Sat hard- 
ness result. If our goal is to increase the inexplicit constant ôo in Theorem 7.35, 
it makes sense to look at how the constant arises. From the proof of Theo- 
rem 7.35 we see that it’s just the rejection rate in the PCPP Theorem. We didn’t 
prove that theorem, but let’s consider its length-2”" analogue, Theorem 7.18. 
The key ingredient in the proof of Theorem 7.18 is the dictator test. Indeed, if 
we strip away the few local correcting and consistency checks, we see that the 
dictator test component controls both the rejection rate and the type of pred- 
icates output by the PCPP reduction. This observation suggests that to get a 
strong hardness-of-approximation result for, say, Max-E3-Lin, we should seek 
a local tester for dictatorship which (a) has a large rejection rate, and (b) makes 
its accept/reject decision using 3-variable linear equation predicates. 

This approach (which of course needs to be integrated with efficient 
“PCPP technology”) was suggested in a 1995 paper of Bellare, Goldreich, 
and Sudan (Bellare et al., 1995). Using it, they managed to prove NP-hardness 
of (1 — do, 1)-approximating Max-E3-Sat with the explicit constant 59 = .026. 
Hastad’s key conceptual contribution (originally from (Hastad, 1996)) was 
showing that given known PCPP technology, it suffices to construct a cer- 
tain kind of relaxed dictator test. Roughly speaking, dictators should still be 
accepted with probability 1 (or close to 1), but only functions which are “very 
unlike” dictators need to be rejected with substantial probability. Since this is 
a weaker requirement than in the standard definition of a local tester, we can 
potentially achieve a much higher rejection rate, and hence a much stronger 
hardness-of-approximation result. 

For these purposes, the most useful formalization of being “very unlike 
a dictator” turns out to be “having no notable coordinates” in the sense of 
Definition 6.9. We make the following definition which is appropriate for 
Boolean CSPs. 


Definition 7.37. Let W be a finite set of predicates over the domain Q = 
{-1, 1}. Let 0 <a < 6 <1 and let A: [0,1] — [0, 1] satisfy A(€) — 0 as 
e — 0. Suppose that for each n € Nt there is a local tester for functions 
f :{—1, 1}” — {-1, 1} with the following properties: 


e If f is a dictator then the test accepts with probability at least £. 

e If f has no (€, €)-notable coordinates — i.e., Inf Tf] < e foralli € [n]- 
then the test accepts with probability at most æ + À(€). 

e The tester’s accept/reject decision uses predicates from W; i.e., the tester can 
be viewed as an instance of Max-CSP(W). 
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Then, abusing terminology, we call this family of testers an (œ, 6)-Dictator- 
vs.-No-Notables test using predicate set Y. 


Remark 7.38. For very minor technical reasons, the above definition should 
actually be slightly amended. In this section we freely ignore the amendments, 
but for the sake of correctness we state them here. One is a strengthening, one 
is a weakening. 


¢ The second condition should be required even for functions f : {—1, 1 > 
{[—1, 1]; what this means is explained in Exercise 7.22. 

e When the tester makes accept/reject decisions by applying y € W to query 
results f(x), ..., f(x), it is allowed that the query strings are not all 
distinct. (See Exercise 7.31.) 


Remark 7.39. It’s essential in this definition that the “error term” À(€) = oe(1) 
be independent of n. On the other hand, we otherwise care very little about 
the rate at which it tends to 0; this is why we didn’t mind using the same 
parameter € in the “(e, €)-notable” hypothesis. 


Just as the dictator test was the key component in our PCPP reduction 
(Theorem 7.18), Dictator-vs.-No-Notables tests are the key to obtaining strong 
hardness-of-approximation results. The following result (essentially proved in 
Khot et al. (Khot et al., 2007)) lets you obtain hardness results from Dictator- 
vs.-No-Notables tests in a black-box way: 


Theorem 7.40. Fix a CSP over domain Q = {—1, 1} with predicate set Y. 
Suppose there exists an (a, B)-Dictator-vs.-No-Notables test using predicate 
set Y. Then for all 5 > 0, it is “UG-hard” to (a+ ô, B — 5)-approximate 
Max-CSP(W). 


In other words, the distinguishing parameters of a Dictator-vs.-No-Notables 
test automatically translate to the distinguishing parameters of a hardness result 
(up to an arbitrarily small ô). 

The advantage of Theorem 7.40 is that it reduces a problem about compu- 
tational complexity to a purely Fourier-analytic problem, and a constructive 
one at that. The theorem has two disadvantages, however. The first is that 
instead of NP-hardness — the gold standard in complexity theory — it merely 
gives “UG-hardness”, which roughly means “at least as hard as the Unique- 
Games problem”. We leave the definition of the Unique-Games problem to 
Exercise 7.27, but suffice it to say it’s not as universally believed to be hard 
as Circuit-Sat is. The second disadvantage of Theorem 7.40 is that it only has 
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ß —6 rather than $. This can be a little disappointing, especially when you 
are interested in hardness for satisfiable instances (6 = 1), as in Hastad’s 3-Sat 
Hardness. In his work, Hastad showed that both disadvantages can be erased 
provided you construct something similar to, but more complicated than, an 
(a, B)-Dictator-vs.-No-Notables test. This is how the Hastad 3-Sat and 3-Lin 
Hardness Theorems are proved. Describing this extra complication is beyond 
the scope of this book; therefore we content ourselves with the following 
theorems: 


Theorem 7.41. For any0 < ô < t, there exists a G + ô, 1)-Dictator-vs.-No- 
Notables test which uses logical OR functions on 3 literals as its predicates. 


Theorem 7.42. For any 0 < ô < 5, there exists a G, 1 — ô)-Dictator-vs.-No- 
Notables test using 3-variable F2-linear equations as its predicates. 


Theorem 7.42 will be proved below, while the proof of Theorem 7.41 is 
left for Exercise 7.29. By applying Theorem 7.40 we immediately deduce the 
following weakened versions of Hastad’s Hardness Theorems: 


Corollary 7.43. For any 6 > 0, it is UG-hard to G + ô, 1 — ô)-approximate 
Max-E3-Sat. 


Corollary 7.44. For any ô > 0, it is UG-hard to G + ô, 1 — 5)-approximate 
Max-E3-Lin. 


Remark 7.45. For Max-E3-Lin, we don’t mind the fact that Theorem 7.40 
has 6 — ô instead of $ because our Dictator-vs.-No-Notables test only accepts 
dictators with probability 1 — 6 anyway. Note that the 1 — 6 in Theorem 7.42 
cannot be improved to 1; see Exercise 7.7.) 


To prove a result like Theorem 7.42 there are two components: the design 
of the test, and its analysis. We begin with the design. Since we are looking 
for a test using 3-variable linear equation predicates, the BLR Test naturally 
suggests itself; indeed, all of its checks are of the form f(x) + f(y) + f(z) = 0. 
It also accepts dictators with probability 1. Unfortunately it’s not true that it 
accepts functions with no notable coordinates with probability close to L, 
There are two problems: the constant 0 function and “large” parity functions 
are both accepted with probability 1, despite having no notable coordinates. 
The constant 1 function is easy to deal with: we can replace the BLR Test by 


the “Odd BLR Test”. 
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Odd BLR Test. Given query access to f : F} > Fo: 


e Choose x ~ F} and y ~ F; independently. 
e Choose b ~ F, uniformly at random and set z =x + y + (b,b,...,b) € 
e Accept if f(x) + f(y) + f) =b. 


Note that this test uses both kinds of 3-variable linear equations as its 
predicates. For the test’s analysis, we as usual switch to +1 notation and think 
of testing f(x) f(y) f(z) = b. It is easy to show the following (see the proof of 
Theorem 7.42, or Exercise 7.15 for a generalization): 


Proposition 7.46. The Odd BLR Test accepts f : {—1, 1}" —> {—1, 1} with 
probability 


+30 fis sith max { f(S)}. 
' Toda 

This twist rules out the constant 1 function; it passes the Odd BLR Test with 
probability 5. It remains to deal with large parity functions. Hastad’s innovation 
here was to add a small amount of noise to the Odd BLR Test. Specifically, 
given a small 5 > 0 we replace z in the above test with z’ ~ Nj_5(z); i.e., 
we flip each of its bits with probability 6/2. If f is a dictator, then there is 
only a 6/2 chance this will affect the test. On the other hand, if f is a parity 
of large cardinality, the cumulative effect of the noise will destroy its chance 
of passing the linearity test. Note that parities of small odd cardinality will 
also pass the test with probability close to 1; however, we don’t need to worry 
about them since they have notable coordinates. We can now present Hastad’s 
Dictator-vs.-No-Notables test for Max-E3-Lin. 


Proof of Theorem 7.42. Given a parameter 0 < 6 < 1, define the following 
test, which uses Max-E3-Lin predicates: 


Hastad; Test. Given query access to f : {—1, 1} > {-1, 1}: 


e Choose x, y ~ {—1, 1}" uniformly and independently. 

e Choose bitb ~ {—1, 1} uniformly and set z = b - (x o y) € {—1, 1}” (where 
o denotes entry-wise multiplication). 

e Choose z' ~ Ni—5(2). 


e Accept if f(x) f(y) f(z’) =b. 
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We will show that this is a G, 1 — 6/2)-Dictator-vs.-No-Notables test. First, 
let us analyze the test assuming b = 1. 


Pr[Hastad; Test accepts f | b = 1] = Eli + i fa) f(y f@)] 


E[ f(x): fO): Ti- fŒ © y)]] 
+ ELS) (f x Ti- f] 


1+1 AS) fT fS) 


S¢[n] 
=5+3 > 0- OI f(sy. 
SC[n] 
On the other hand, when b= —1 we take the expectation of $A 


S f(x) f(y) f(z’) and note that z’ is distributed as N_(j_5)(x o y). Thus 


Pr[Hastads Test accepts f | b = —1] = 4-4 oe DIA — sy ASY. 
SC[n] 


Averaging the above two results we deduce 


Pr[Hastad; Test accepts f] = 4+ 4 > a — A RASY. (7.4) 
|S| odd 


(Incidentally, by taking 5 = 0 here we obtain the proof of Proposition 7.46.) 

From (7.4) we see that if f is a dictator, f = xs with |S| = 1, then it is 
accepted with probability 1 — 6/2. (It’s also easy to see this directly from the 
definition of the test.) To complete the proof that we have a G, 1 — ô/2)- 
Dictator-vs.-No-Notables test, we need to bound the probability that f is 
accepted given that it has (€,¢€)-small stable influences. More precisely, 
assuming 


nfi] = Y a - S-I FSP <e foralli efn] (7.5) 


S>di 


we will show that 
Pr[Hastad; Test accepts f] < 5 + iVE, provided € < ô. (7.6) 
This is sufficient because we can take A(€) in Definition 7.37 to be 


iG fore < ô, 


1 


Ae) = 
5 fore > ô. 


186 


7 Property Testing, PCPPs, and CSPs 


Now to obtain (7.6), we continue from (7.4): 


Pr[Hastad; Test accepts f] < } + 3 max {a -HIRO Y ASF 


|S| odd 


< 5+ 5 max {Cl = DI FO 


<3 +4 [magia = D25 fio 


IA 


+2 maxi — 9-77 


< 3 + 3,/max(Inf! Lf), 


where we used that |S] odd implies S nonempty. And the above is indeed at 


most + + 4./€ provided € < ô, by (7.5). 


7.1 


7.2 


71.3 
TA 


75 


7.6 


7.5. Exercises and Notes 


Suppose there is an r-query local tester for property @ with rejection rate À. 
Show that there is a testing algorithm that, given inputs 0 < €, 6 < 1/2, 
makes oct) (nonadaptive) queries to f and satisfies the following: 
e If f € E, then the tester accepts with probability 1. 

e If f is e-far from ©, then the tester accepts with probability at most ô. 
Let M = {(x, y) € {0, 1}” : x = y}, the property that a string’s first half 
matches its second half. Give a 2-query local tester for .@ with rejection 
rate 1. (Hint: Locally test that x ® y = (0,0,..., 0).) 

Reduce the proof length in Example 7.15 to n — 2. 

Verify the claim from Example 7.12 regarding the 2-query tester for the 


property that a string has all its coordinates equal. (Hint: Use +1 notation.) 


Let G = {w € F} : w has an odd number of 1’s}. Let T be any (n — 1)- 
query string testing algorithm that accepts every w € @ with probability 1. 
Show that T in fact accepts every string v € F} with probability 1 (even 
though dist(w, @) = 1 > 0 for half of all strings w). Thus locally testing © 
requires n queries. 


Let T be a 2-query testing algorithm for functions {—1, 1}” —> {—1, 1}. 
Suppose that J accepts every dictator with probability 1. Show that it also 
accepts Maj„ with probability 1 for every odd n’ < n. This shows that 
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there is no 2-query local tester for dictatorship assuming n > 2. (Hint: 


You’ll need to enumerate all predicates on up to 2 bits.) 


7.7 For every a < 1, show that there is no (a, 1)-Dictator-vs.-No-Notables 
test using Max-E3-Lin predicates. (Hint: Consider large odd parities.) 


7.8 (a) Consider the following 3-query testing algorithm for f : {0, 1}” > 


{0, 1}. Let x, y ~ {0, 1}” be independent and uniformly random, 
define z € {0, 1}” by z; =x; A y; for each i € [n], and accept if 
S@)A f(y) = f(z). Let p; be the probability that this test accepts 
a parity function xs: {0, 1}” — {0,1} with |S| =k. Show that 
Po = pı = | and that in general py < 5 +271. In fact, you might 
like to show that py = 5 + G — 1(—1))2>. (Hint: It suffices to con- 
sider k = n and then compute the correlation of X{q1,...,n} ^ X{n+1,...,2n} 
with the bent function IP>,,.) 


(b) Show how to obtain a 3-query local tester for dictatorship by com- 


bining the following subtests: (i) the Odd BLR Test; (ii) the test from 
part (a). 


7.9 Obtain the largest explicit rejection rate in Theorem 7.7 that you can. You 
might want to return to the Fourier expressions arising in Theorem 1.30 
and 2.56, as well as Exercise 1.28. Can you improve your bound by doing 
the BLR and NAE Tests with probabilities other than 1/2, 1/2? 


7.10 (a) Say that A is an (œ, 6)-distinguishing algorithm for Max-CSP(W) if it 


(b 


ma 


outputs ‘YES’ on instances with value at least 6 and outputs ‘NO’ on 
instances with value strictly less than a. (On each instance with value 
in [a, £), algorithm A may have either output.) Show that if there is an 
efficient (a, 6)-approximation algorithm for Max-CSP(V), then there 
is also an efficient (œ, 6)-distinguishing algorithm for Max-CSP(Y). 
Consider Max-CSP(V), where Y be a class of predicates that is closed 
under restrictions (to nonconstant functions); e.g., Max-3-Sat. Show 
that if there is an efficient (1, 1)-distinguishing algorithm, then there 
is also an efficient (1, 1)-approximation algorithm. (Hint: Try out all 
labels for the first variable and use the distinguisher.) 


7.11 (a) Let ¢ be a CNF of size s and width w > 3 over variables x1, ..., Xn. 


Show that there is an “equivalent” CNF ¢’ of size at most (w — 2)s 
and width 3 over the variables x,,..., x, plus auxiliary variables 
I, ..., Ig, with £ < (w — 3)s. Here “equivalent” means that for 
every x such that ¢(x) = True there exists TI such that ¢’(x, II) = 
True; and, for every x such that @(x) = False we have ¢’(x, IT) = False 
for all TT. 
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(b) Extend the above so that every clause in ¢’ has width exactly 3 (the 
size may increase by O(s)). 

Suppose there exists an r-query PCPP reduction #, with rejection rate À. 

Show that there exists a 3-query PCPP reduction Æ, with rejection rate at 

least 4/(r2"). The proof length of #2 should be at most r2” -m plus the 

proof length of 2, (where m is the description-size of %,’s output) and 
the predicates output by the reduction should all be logical ORs applied 

to exactly three literals. (Hint: Exercises 4.1, 7.11.) 

(a) Give a polynomial-time algorithm R that takes as input a general 
Boolean circuit C and outputs a width-3 CNF formula @ with the 
following guarantee: C is satisfiable if and only if @ is satisfiable. 
(Hint: Introduce a variable for each gate in C.) 

(b) The previous exercise in fact formally justifies the following state- 
ment: “(1, 1)-distinguishing Max-3-Sat is NP-hard”. (See Exer- 
cise 7.10 for the definition of (1, 1)-distinguishing.) Argue that, 
indeed, if (1, 1)-distinguishing (or (1, 1)-approximating) Max-3-Sat 
is in polynomial time, then so is Circuit-Sat. 

(c) Prove Theorem 7.33. (Hint: Exercise 7.11(b).) 

Describe an efficient (1, 1)-approximation algorithm for Max-Cut. 

(a) Let H be any subspace of F} and let #= {x, : F} > {-1, 1} | 
y € H+}. Give a 3-query local tester for # with rejection rate 1. 
(Hint: Similar to BLR, but with (gy * f, f * f).) 

(b) Generalize to the case that H is any affine subspace of F3. 

Let A be any affine subspace of F}. Construct a 3-query, length-2” PCPP 

system for A with rejection rate a positive universal constant. (Hint: Given 

w € F}, the tester should expect the proof II € {—1, 1} to encode the 

truth table of Xw. Use Exercise 7.15 and also a consistency check based 

on local correcting of II at e;, where i € [n] is uniformly random.) 

(a) Give a 3-query, length-O(n) PCPP system (with rejection rate a pos- 
itive universal constant) for the class {w € F} : IP,(w) = 1}, where 
IP, is the inner product mod 2 function (n even). 

(b) Do the same for the complete quadratic function CQ, from Exer- 
cise 1.1. (Hint: Exercise 4.13.) 

In this exercise you will prove Theorem 7.19. 

(a) Let D € F,”” be a nonzero matrix and suppose x, y ~ F} are uni- 
formly random and independent. Show that Pr[ y! Dx 4 0] > L, 

(b) Let y € F} and r € F5*". Suppose x, y ~ F} are uniformly ran- 
dom and independent. Show that Pr[(y'x)(v' y) =T e (xy! )]is 1 
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if © = yy! and is at most 3 otherwise. Here we use the notation 

BeC= Sig B,,C;; for matrices B, C € F5*". 

Suppose you are given query access to two functions £ : F} — Fz and 

q : F3” — F2. Give a 4-query testing algorithm with the following 

two properties (for some universal constant A > 0): (i) if £ = x, and 

q = Xyyt for some y € F}, the test accepts with probability 1; (ii) for 

all O < e < 1, if the test accepts with probability at least 1 — y - €, 

then there exists some y € F} such that £ is €-close to x, and q is 

€-close to x,,,1. (Hint: Apply the BLR Test to £ and q, and use part (b) 

with local correcting on q.) 

(d) Let L be a list of homogenous degree-2 polynomial equa- 

tions over variables wy ,,...,w, E€ F2. (Each equation is of 

the form }; j 

remark that w? = w;.) Define the string property Z = {we F5 : 

w satisfies all equations in L}. Give a 4-query, length-(2” + on) 

PCPP system for Z (with rejection rate a positive universal con- 

stant). (Hint: The tester should expect the truth table of Xw and XwwT. 

You will need part (c) as well as Exercise 7.15 applied to “q”.) 

Complete the proof of Theorem 7.19. (Hints: given w e€ {0, 1}”, 

the tester should expect a proof consisting of all gate values w € 

{0, 1}8°© in C’s computation on w, as well as truth tables of xg and 

Xwwt- Show that w being a valid computation of C is encodable with 

a list of homogeneous degree-2 polynomial equations. Add a consis- 

tency check between w and w using local correcting, and reduce the 

number of queries to 3 using Exercise 7.12.) 

7.19 Verify the connection between Opt(#) and C’s satisfiability stated in the 
proof sketch of Theorem 7.35. (Hint: Every string w is 1-far from the 
empty property.) 

7.20 A randomized assignment for an instance Y of a CSP over domain Q is a 
mapping F that labels each variable in V with a probability distribution 
over domain elements. Given a constraint (S, y) with S = (v1,..., v+), 
we write Y(F(S)) € [0, 1] for the expected value of Y (F (v1), ..., F(v;)). 
This is simply the probability that y is satisfied when one actually draws 
from the domain-distributions assigned by F. Finally, we define the value 
of F to be Vala F) = Es y) gl W(F(S))]. 

(a) Suppose that A is a deterministic algorithm that produces a random- 
ized assignment of value a on a given instance Y Show a simple 
modification to A that makes it a randomized algorithm that produces 
a (normal) assignment whose value is œ in expectation. (Thus, in 


(c 


wa 


cjjwiw; =b for constants b,c; € F2; we 


(e 


Ne 
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constructing approximation algorithms we may allow ourselves to 
output randomized assignments.) 
Let A be the deterministic Max-E3-Sat algorithm that on every 
instance outputs the randomized assignment that assigns the uni- 
form distribution on {0, 1} to each variable. Show that this is a (Z, B)- 
approximation algorithm for any 6. Show also that the same algorithm 
is a G, ß)-approximation algorithm for Max-3-Lin. 
When the domain Q is {—1, 1}, we may model a randomized assign- 
ment as a function f : V —> [—1, 1]; here f(v) = n is interpreted as 
the unique probability distribution on {—1, 1} which has mean u. Now 
given a constraint (S, Y) with S = (v1,..., v+), show that the value 
of f on this constraint is in fact y(f(v1), ..., f(v;)), where we iden- 
tify y : {-1, 1}” — {0, 1} with its multilinear (Fourier) expansion. 
(Hint: Exercise 1.4.) 
(d) Let Y be a collection of predicates over domain {—1, 1}. Let 
v=minyew{ Y (Ø)}. Show that outputting the randomized assignment 
f = 0is an efficient (v, 8)-approximation algorithm for Max-CSP(Y). 


(b 


wm 


(c 


wm 


Let F be a randomized assignment of value œ for CSP instance ¥ (as 
in Exercise 7.20). Give an efficient deterministic algorithm that outputs a 
usual assignment F of value at least œ. (Hint: Try all possible labelings for 
the first variable and compute the expected value that would be achieved 
if F were used for the remaining variables. Pick the best label for the first 
variable and repeat.) 

Given a local tester for functions f : {—1, 1} — {-1, 1}, we can inter- 
pret it also as a tester for functions f : {—1, 1}” —> [-—1, 1]; simply view 
the tester as a CSP and view the acceptance probability as the value of f 
when treated as a randomized assignment (as in Exercise 7.20(c)). Equiv- 
alently, whenever the tester “queries” f(x), imagine that what is returned 
is a random bit b € {—1, 1} whose mean is f(x). This interpretation 
completes Definition 7.37 of Dictator-vs.-No-Notables tests for functions 
f :{-1, 1} — [-1, 1] (see Remark 7.38). Given this definition, verify 
that the Hastad; Test is indeed a G, 1 — ô)-Dictator-vs.-No-Notables test. 
(Hint: Show that (7.4) still holds for functions f : {—1, 1}” —> [-1, 1]. 
There is only one subsequent inequality that uses that f’s range is {—1, 1}, 
and it still holds with range [—1, 1].) 

Let Y be a finite set of predicates over domain Q = {—1, 1} that is 
closed under negating variables. (An example is the scenario of Max-w 
from Remark 7.23.) In this exercise you will show that Dictator-vs.- 
No-Notables tests using Y may assume f : {—1, 1}” > [—1, 1] is odd 
without loss of generality. 


7.24 


7.25 


7.26 


7.27 


7.28 
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(a) Let T bean (qa, )-Dictator-vs.-No-Notables test using predicate set Y 
that works under the assumption that f : {—1, 1}” —> [—1, 1] is odd. 
Modify T as follows: Whenever it is about to query f(x), with prob- 
ability 5 let it use f(x) and with probability 5 let it use — f(—x). Call 
the modified test T’. Show that the probability T’ accepts an arbitrary 
f : {—1, 1}" — [-1, 1] is equal to the probability T accepts f°% 
(recall Exercise 1.8). 

(b) Prove that T’ is an (œ, B)-Dictator-vs.-No-Notables test using predi- 
cate set Y for functions f : {—1, 1}” > [-1, 1]. 

This problem is similar to Exercise 7.23 in that it shows you may assume 

that Dictator-vs.-No-Notables tests are testing “smoothed” functions of 

the form T,_3h for h : {—1, 1}” —> [-1, 1], so long as you are willing to 
lose O(6) in the probability that dictators are accepted. 

(a) Let U be an (a, B)-Dictator-vs.-No-Notables test using an arity-r 
predicate set Y (over domain {—1, 1}) which works under the assump- 
tion that the function f : {—1, 1}” —> [-1, 1] being tested is of the 
form T,_sh for h : {—1, 1}” — [-1, 1]. Modify U as follows: when- 
ever it is about to query f(x), let it draw y ~ Nj_3(x) and use f(y) 
instead. Call the modified test U’. Show that the probability U’ accepts 
an arbitrary h : {—1, 1}” —> [—1, 1] is equal to the probability U 
accepts Tj_sh. 

(b) Prove that U’ is an (a, B — ré/2)-Dictator-vs.-No-Notables test using 
predicate set Y. 

Give a slightly alternate proof of Theorem 7.42 by using the original 

BLR Test analysis and applying Exercises 7.23, 7.24. 

Show that when using Theorem 7.40, it suffices to have a “Dictators-vs.- 

No-Influentials test”, meaning replacing Inf('~° [f] in Definition 7.37 

with just Inf;[ f]. (Hint: Exercise 7.24.) 

For q € Nt, Unique-Games(q) refers to the arity-2 CSP with domain 

Q = [q] in which all q! “bijective” predicates are allowed; here wy is 

“bijective” if there is a bijection m : [q] — [q] such that WG, j) = 1 

iff w(j) =i. Show that (1, 1)-approximating Unique-Games(q) can be 

done in polynomial time. (The Unique Games Conjecture of Khot (Khot, 

2002) states that for all 5 > O there exists q € N* such that (6, 1 — 5)- 

approximating Unique-Games(q) is NP-hard.) 

In this problem you will show that Corollary 7.43 actually follows directly 

from Corollary 7.44. 

(a) Consider the F-linear equation vı + v2 + v3 = 0. Exhibit a list of 4 
clauses (i.e., logical ORs of literals) over the variables such that if the 
equation is satisfied, then so are all 4 clauses, but if the equation is 
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not satisfied, then at most 3 of the clauses are. Do the same for the 
equation vı + v2 + v3 = 1. 
(b) Suppose that for every ô > 0 there is an efficient algorithm for 
G + ô, 1 — ô)-approximating Max-E3-Sat. Give, for every 6 > 0, an 
efficient algorithm for G + ô, 1 — ô)-approximating Max-E3-Lin. 
(c) Alternatively, show how to transform any (œ, )-Dictator-vs.-No- 
Notables test using Max-E3-Lin predicates into a G + la, B)- 
Dictator-vs.-No-Notables test using Max-E3-Sat predicates. 
7.29 In this exercise you will prove Theorem 7.41. 
(a) Recall the predicate OXR from Exercise 1.1. Fix a small 0 < 6 < 1. 
The remainder of the exercise will be devoted to constructing a 
G + ô/4, 1)-Dictator-vs.-No-Notables test using Max-OXR predi- 
cates. Show how to convert this to a G + ô/8, 1)-Dictator-vs.-No- 
Notables test using Max-E3-Sat predicates. (Hint: Similar to Exer- 
cise 7.28(c).) 
By Exercise 7.23, it suffices to construct a G + 6/4, 1)-Dictator-vs.- 
No-Notables test using the OXR predicate assuming f : {—1, 1}” > 
[—1, 1] is odd. Hastad tests OXR( f(x), f(y), f(z)) where x, y,z€ 
{—1, 1}” are chosen randomly as follows: For each i € [n] (indepen- 
dently), with probability 1 — ô choose (x;, y,;, zi) uniformly subject 
to x; y;Z; = —1, and with probability ô choose (x;, y;, zi) uniformly 
subject to y;z; = —1. Show that the probability this test accepts an 
odd f : {-1, 1}” > [-1, 1] is 


(b 


wm 
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where J Ci—s S denotes that J is a (1 — ô)-random subset of S in 
the sense of Definition 4.15. In particular, show that dictators are 
accepted with probability 1. 

(c) Upper-bound (7.7) by 


i +8/4+ 3V0 -8Y +3) OYE ADN, 


5S 
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or something stronger. (Hint: Cauchy—Schwarz.) 
(d) Complete the proof that this is a G + 6/4, 1)-Dictator-vs.-No- 
Notables test, assuming f is odd. 
7.30 In this exercise you will prove Theorem 7.40. Assume there exists an 
(a, B)-Dictator-vs.-No-Notables test T using predicate set Y over domain 
{—1, 1}. We define a certain efficient algorithm R, which takes as input 
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an instance 4 of Unique-Games(q) and outputs an instance Y of Max- 

CSP(W). For simplicity we refer to the variables V of the Unique-Games 

instance @ as “vertices” and its constraints as “edges”. We also assume 

that when @ is viewed as an undirected graph, it is regular. (By a result 
of Khot—Regev (Khot and Regev, 2008) this assumption is without loss 
of generality for the purposes of the Unique Games Conjecture.) The 

Max-CSP(W) instance Y output by algorithm R will have variable set 

V x {-1, 1}, and we write assignments for it as collections of func- 

tions (fv)vev, where f : {—1, 1}4 — {—1, 1}. The draw of a random of 

constraint for Z is defined as follows: 

e Choose u € V uniformly at random. 

e Draw a random constraint from the test 7; call it 
PEED)... fe). 

e Choose r random “neighbors” v;,...,v, of u in Y, independently 
and uniformly. (By a neighbor of u, we mean a vertex v such that 
either (u, v) or (v, u) is the scope of a constraint in &.) Since #s 
constraints are bijective, we may assume that the associated scopes are 
(u, v1), ..., (u, v;) with bijections m1, ..., æ, : [q] > [q]. 

* Output the constraint Y (fZ (x), ..., Y (SZ (x®)), where we use the 
permutation notation f” from Exercise 1.30. 

(a) Suppose Opt(4) > 1 — ô. Show that there is an assignment for ¥ 
with value at least 6 — O(6) in which each f, is a dictator. (You will 
use regularity of 4 here.) Thus Opt(P) > 6B — O(6). 

(b) Given an assignment F = (f,),ey for Z, introduce for each u € V 
the function g, : {—1, 1} — [—1, 1] defined by g(x) = E [ fF (x)], 
where v is a random neighbor of u in @ and m is the associated con- 
straint’s permutation. Show that Valg(F) = E,ev[Valr(gu)] (using 
the definition from Exercise 7.22). 

(c) Fix an € > 0 and suppose that Valg(F) = s + 2A(€), where A is 
the “rejection rate” associated with T. Show that for at least a 
A(e)-fraction of vertices u € V, the set NbrNotable, = {i € [q] : 
Inf} [g] > €} is nonempty. 

(d) Show that for any u € V and i € [q] we have Einf foll = 
Inf('~° [gu], where v is a random neighbor of u and x is the associated 
constraint’s permutation. (Hint: Exercise 2.48.) 

(e) For v € V, define also the set Notable, = {i € [q] : Inf~°[ f] > 
€/2}. Show that if i € NbrNotable,, then Pr,[2~!(i) € Notable,] > 
€/2, where v and x are as in the previous part. 

(f) Show that for every u € V we have |Notable, U NbrNotable,| < 
O(1/e?). (Hint: Proposition 2.54.) 
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(g) Consider the following randomized assignment for @ (see 
Exericse 7.20): for each u € V, give it the uniform distribution on 
Notable, U NbrNotable, (if this set is nonempty; otherwise, give it 
an arbitrary labeling). Show that this randomized assignment has 
value Q(A(e)e>). 

(h) Conclude Theorem 7.40, where “UG-hard” means “NP-hard assum- 
ing the Unique Games Conjecture”. 


7.31 Technically, Exercise 7.30 has a small bug: Since a Dictator-vs.-No- 
Notables test using predicate set W is allowed to use duplicate query 
strings in its predicates (see Remark 7.38), the reduction in the previous 
exercise does not necessarily output instances of Max-CSP(W) because 
our definition of CSPs requires that each scope consist of distinct vari- 
ables. In this exercise you will correct this bug. Let M € N* and suppose 
we modify the algorithm R from Exercise 7.30 to a new algorithm R’, 
producing an instance # with variable set V x [M] x {—1, 1}. We now 
think of assignments to Y as M-tuples of functions f},..., f, 

tuple for each v € V. Further, thinking of “as a function tester, we have 
Z actas follows: Whenever Zis about to query f,(x), we have Y instead 
query fi (x) for a uniformly random j € [M]. 
(a) Show that O(A) = Op( 2). 
(b) Show that if we delete all constraints in # for which the scope 
contains duplicates, then Opt(#) changes by at most 1/M. 

(c) Show that the deleted version of is a genuine instance of Max- 
CSP(W). Since the constant 1/M can be arbitrarily small, this corrects 
the bug in Exercise 7.30’s proof of Theorem 7.40. 


one 


Notes 


The study of property testing was initiated by Rubinfeld and Sudan (Rubinfeld and 
Sudan, 1996) and significantly expanded by Goldreich, Goldwasser, and Ron (Goldreich 
etal., 1998); the stricter notion of local testability was introduced (in the context of error- 
correcting codes) by Friedl and Sudan (Friedl and Sudan, 1995). The first local tester for 
dictatorship was given by Bellare, Goldreich, and Sudan (Bellare et al., 1995, 1998) (as 
in Exercise 7.8); it was later rediscovered by Parnas, Ron, and Samorodnitsky (Parnas 
etal., 2001, 2002). The relevance of Arrow’s Theorem to testing dictatorship was pointed 
out by Kalai (Kalai, 2002). 

The idea of assisting testers by providing proofs grew out of complexity-theoretic 
research on interactive proofs and PCPs; see the early work Ergiin, Kumar, and Rubin- 
feld (Ergiin et al., 1999) and the references therein. The specific definition of PCPPs was 
introduced independently by Ben-Sasson, Goldreich, Harsha, Sudan, and Vadhan (Ben- 
Sasson et al., 2004) and by Dinur and Reingold (Dinur and Reingold, 2004) in 2004. 
Both of these works obtained the PCPP Theorem, relying on the fact that previous 
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literature essentially already gave PCPP reductions of exponential (or greater) proof 
length: Ben-Sasson et al. (Ben-Sasson et al., 2004) observed that Theorem 7.19 can 
be obtained from Arora et. al. (Arora et al., 1998) (their proof is Exercise 7.18), while 
Dinur and Reingold (Dinur and Reingold, 2004) pointed out that the slightly easier 
Theorem 7.18 can be extracted from the work of Bellare, Goldreich, and Sudan (Bellare 
et al., 1998). The proof we gave for Theorem 7.16 is inspired by the presentation in 
Dinur (Dinur, 2007). 

The PCP Theorem and its stronger forms (the PCPP Theorem and Theorem 7.20) 
have a somewhat remarkable consequence. Suppose a researcher claims to prove a 
famous mathematical conjecture, say, “P 4 NP’. To ensure maximum confidence in 
correctness, a journal might request the researcher submit a formalized proof, suitable 
for a mechanical proof-checking system. If the submitted formalized proof w is a 
Boolean string of length n, the proof-checker will be implementable by a circuit C of 
size O(n). Notice that the string property @ decided by C is nonempty if and only if 
there exists a (length-n) proof of P # NP. Suppose the journal applies Theorem 7.20 
to C and requires the researcher submit the additional proof I of length n - polylog(n). 
Now the journal can run a rather amazing testing algorithm, which reads just 3 bits of 
the submitted proof (w, I). If the researcher’s proof of P Æ NP is correct then the test 
will accept with probability 1. On the other hand, if the test accepts with probability at 
least 1 — y (where y is the rejection rate in Theorem 7.20), then w must be 1-close to 
the set of strings accepted by C. This doesn’t necessarily mean that w is a correct proof 
of P Æ NP — but it does mean that @ is nonempty, and hence a correct proof of P 4 NP 
exists! By querying a larger constant number of bits from (w, I) as in Exercise 7.1, say, 
[30/y] bits, the journal can become 99.99% convinced that indeed P 4 NP. 

CSPs are very widely studied in computer science; it is impossible to survey the topic 
here. In the case of Boolean CSPs various monographs (Creignou et al., 2001; Khanna 
et al., 2001) contain useful background regarding complexity theory and approxima- 
tion algorithms. The notion of approximation algorithms and the derandomized G, 1)- 
approximation algorithm for Max-E3-Sat (Proposition 7.36, Exercise 7.21) are due to 
Johnson (Johnson, 1974). Incidentally, there is also an efficient G, 1)-approximation 
algorithm for Max-3-Sat (Karloff and Zwick, 1997), but both the algorithm and its 
analysis are extremely difficult, the latter requiring computer assistance (Zwick, 2002). 

Hastad’s hardness theorems appeared in 2001 (Hastad, 2001b), building on ear- 
lier work (Hastad, 1996, 1999). Hastad (Hastad, 2001b) also proved NP-hardness 
of ( i + ô, 1 — ô)-approximating Max-E3-Lin(mod p) (for p prime) and of G, 1)- 
approximating Max-CSP({NAE4}), both of which are optimal. Using tools due to 
Trevisan et al. (Trevisan et al., 2000), Hastad also showed NP-hardness of 7 +6, 3 - 
approximating Max-Cut, which is still the best known such result. The best known 
inapproximability result for Unique-Games(q) is NP-hardness of (3 + q7°®, 3)- 
approximation (O’Donnell and Wright, 2012). Khot’s influential Unique Games Con- 
jecture dates from 2002 (Khot, 2002); the peculiar name has its origins in a work of 
Feige and Lovasz (Feige and Lovasz, 1992). The generic Theorem 7.40, giving UG- 
hardness from Dictator-vs.-No-Notables tests, essentially appears in Khot et al. (Khot 
et al., 2007). (We remark that the terminology “Dictator-vs.-No-Notables test” is not 
standard.) If one is willing to assume the Unique Games Conjecture, there is an almost- 
complete theory of optimal inapproximability due to Raghavendra (Raghavendra, 2009). 
Many more inapproximability results, with and without the Unique Games Conjecture, 
are known; for some surveys, see those of Khot (Khot, 2005, 2010a,b). 
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As mentioned, Exercise 7.8 is due to Bellare, Goldreich, and Sudan (Bellare et al., 
1995) and to Parnas, Ron, and Samorodnitsky (Parnas et al., 2001). The technique 
described in Exercise 7.21 is known as the Method of Conditional Expectations. The 
trick in Exercise 7.23 is closely related to the notion of “folding” from the theory of 
PCPs. The bug described in Exercise 7.31 is rarely addressed in the literature; the trick 
used to overcome it appears in, e.g., Arora et al. (Arora et al., 2005). 


8 


Generalized Domains 


So far we have studied functions f : {0,1}” —> R. What about, say, 
f : {0, 1, 2}" — R? In fact, very little of what we’ve done so far depends 
on the domain being {0, 1}”; what it has mostly depended on is our viewing 
the domain as a product probability distribution. Indeed, much of analysis of 
Boolean functions carries over to the case of functions f : Qi x ++» x Q, > R 
where the domain has a product probability distribution zr; @ - -- ® mn. There 
are two main exceptions: the “derivative” operator D; does not generalize to the 
case when |Q;| > 2 (though the Laplacian operator L; does), and the important 
notion of hypercontractivity (introduced in Chapter 9) depends strongly on the 
probability distributions 77. 

In this chapter we focus on the case where all the Q2;’s are the same, as are 
the z;’s. (This is just to save on notation; it will be clear that everything we do 
holds in the more general setting.) Important classic cases include functions 
on the p-biased hypercube (Section 8.4) and functions on abelian groups (Sec- 
tion 8.5). For the issue of generalizing the range of functions — e.g., studying 
functions f : {0, 1, 2}” — {0, 1, 2} — see Exercise 8.33. 


8.1. Fourier Bases for Product Spaces 
We will now begin to discuss functions on (finite) product probability spaces. 


Definition 8.1. Let (Q, 7) be a finite probability space with |Q| > 2 and 
assume z has full support. For n e N* we write L?7(Q", m8") for the (real) 
inner product space of functions f : Q” — R, with inner product 


e) = E f@)g(l- 


x 


Here x®” denotes the product probability distribution on Q”. 
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Example 8.2. A simple example to keep in mind is Q = {a, b, c} with x (a) = 
z(b) = z(c) = 1/3. Here a, b, and c are simply abstract set elements. 


We can (and will) generalize to nondiscrete probability spaces, and to com- 
plex inner product spaces. However, we will keep to the above definition for 
now. 


Notation 8.3. We will write 7/2 for the uniform probability distribution 
on {—1, 1}. Thus so far in this book we have been studying functions in 
L?({-1, 1}", m9). For simplicity, we will write this as L?({=1, 1}"). 


Notation 8.4. Much of the notation we used for L*({—1, 1}”) extends naturally 
to the case of L?(Q", m2”): e.g., || f lp = Exnzo [| f(x)|?]!/”, or the restriction 
notation from Chapter 3.3. 


As we described in Chapter 1.4, the essence of Boolean Fourier analysis is 
in deriving combinatorial properties of a Boolean function f : {—1, 1} —> R 
from its coefficients over a particular basis of L?({—1, 1}"), the basis of parity 
functions. We would like to achieve the same thing more generally for functions 
in L?(Q", m2”). We begin by considering vector space bases more generally. 


Definition 8.5. Let |Q| =m. The indicator basis (or standard basis) for 
L?(Q, 1) is just the set of m indicator functions (1x)xeg, where 


1 ify=x, 

0 ifyA#x. 
Fact 8.6. The indicator basis is indeed a basis for L?(Q, 1) since the functions 
(1.)xeg are nonzero, spanning, and orthogonal. Hence dim(L?(Q, 2)) = m. 


We will usually fix Q and z and then consider L?(Q”, m8”) for n € Nt. 
Applying the above definition gives us an indicator basis (ly)xeg» for the 
m"-dimensional space L7(Q", n2"). The representation of f € L?(Q, x) in 
this basis is just f = )°.¢9 f(x)1,. This is not very interesting; the coefficients 
are just the values of f so they don’t tell us anything new about the function. 
We would like a different basis that will generate useful “Fourier formulas” as 
in Chapter 1.4. 

For inspiration, let’s look critically at the familiar case of L?({—1, 1}"). Here 
we used the basis of all parity functions, xs(x) = Į [jes xi. It will be helpful to 
think of the basis function xs : {—1, 1}” —> R as follows: Identify S with its 
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0-1 indicator vector and write 


xs(x) =| [ġs@:) where go =1, $1 =id. 
i=l 
(Here id is just the identity map id(b) = b.) We will identify three properties 
of this basis which we’d like to generalize. 

First, the parity basis is a product basis. We can break down its “prod- 
uct structure” as follows: For each coordinate i € [n] of the product domain 
{—1, 1}”, the set {1, id} is a basis for the 2-dimensional space L7({—1, 1}, IU} /2). 
We then get a basis for the 2”-dimensional product space L?({—1, 1}”) by tak- 
ing all possible n-fold products. More generally, suppose we are given an inner 
product space L?(Q, x) with |Q| = m. Let ġo, ..., Gm —1 be any basis for this 
space. Then the set of all products ¢;, i, - -- Qi, (0 < i; < m) forms a basis for 
the space L?(Q”, m8”). 

Second, it is convenient that the parity basis is orthonormal. We will 
later check that if a basis ¢o,..., @m—1 for L? (Q, 7) is orthonormal, then 
so too is the associated product basis for L?(Q", x8”). This relies on the 
fact that 7®” is the product distribution. For example, the parity basis for 
L*({—1, 1}") is orthonormal because the basis {1, id} for L?({—1, 1}, 71/2) 
is orthonormal: E[1?] = E[x?] =1, E[1-x,;]=0. Orthonormality is the 
property that makes Parseval’s Theorem hold; in the general context, this 
means that if f € L?(Q, 7) has the representation Y7 cih; then E[,f?] = 
Dino c- 

Finally, the parity basis contains the constant function |. This fact leads 
to several of our pleasant Fourier formulas. In particular, when you take an 
orthonormal basis ¢o,..., @m—1 for L?(Q, 2) which has do = 1, then 0 = 
(bo, 6) = Ex~7[¢;(x)] for alli > 0. Henceif f € L?(Q, 1) has the expansion 
f =o") cii, then EL f] = co and Var[ f] = Y; o ¢?. 

We encapsulate the second and third properties with a definition: 


Definition 8.7. A Fourier basis for an inner product space L*(Q, 7) is an 
orthonormal basis ¢o, ..., Om—1 With dp = 1. 


Example 8.8. For eachn € N*, the 2” parity functions (Xs) sc{nj form a Fourier 
basis for L7({—1, 1}", rpn). 


Remark 8.9. A Fourier basis for L?(Q, x) always exists because you can 
extend the set {1} to a basis and then perform the Gram-Schmidt process. On the 
other hand, Fourier bases are not unique. Even in the case of L°({-1, 1}, 7 /2) 
there are two possibilities: the basis {1, id} and the basis {1, —id}. 


200 8 Generalized Domains 


Example 8.10. In the case of Q = {a, b, c} with x(a) = m (b) = z(c) = 1/3, 
one possible Fourier basis (see Exercise 8.4) is 
oi(a) = +V2 x(a) = 0 
go=1, bi(b)=—V2/2 (b) = +V6/2, 
bi(c) = —V2/2, x(c) = —V6/2. 
As mentioned, given a Fourier basis for L(Q, x) you can construct a Fourier 


basis for any L?(Q”, x8”) by “taking all n-fold products”. To make this precise 
we need some notation. 


Definition 8.11. An n-dimensional multi-index is a tuple a € N”. We write 
supp) = {i: a; £0}, #æ = |supp(a)|, la] = > a. 
i=l 


We may write a € N?„ when we want to emphasize that each œ; € 
{0, 1,...,m— 1}. 


Definition 8.12. Given functions ¢o, ...,@m—1 € L7(Q, 7) and a multi-index 
a € NN”, we define dy € L?(Q", 1®”) by 


<m? 


bax) = | | ¢e,(xi).- 


i=l 
Now we can show that products of Fourier bases are Fourier bases. 
Proposition 8.13. Let ġo, ..., @m—1 be a Fourier basis for L?(Q, 7x). Then the 


collection (Qa Jaen: „ is a Fourier basis for L?7(Q", n8") (with the understanding 
that a = (0,0, ..., 0) indexes the constant function 1). 


Proof. First we check orthonormality. For any multi-indices a, 6B € N”,, we 
have 


(ba, bp) = E, [balx)- bp(*)] 


n 


= E [ee [once] 
i=l 


xora Ld 


i=l 


= I] E [¢o,(%1) - bg, (x;)] (since x8" is a product distribution) 
ere 


n 
= I] 1a;=;} (since {¢ġo, ..., @m—1} is orthonormal) 
i=l 


= l=). 
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This confirms that the collection (¢y)wenx,, is orthonormal, and consequently 
linearly independent. It is therefore also a basis because it has cardinality m”, 
which we know is the dimension of L?(Q” , 1®”) (see Fact 8.6). 


Given a product Fourier basis as in Proposition 8.13, we can express any 
fe L?(Q", x8”) as a linear combination of basis functions. We will write 
F(a) for the “Fourier coefficient” on ¢, in this expression. 


Definition 8.14. Having fixed a Fourier basis ġo, . . . , @m—1 for L*(Q, 7), every 
f € L(Q", m2") is uniquely expressible as 


f= VS fda 


aeN” 


<m 


This is the Fourier expansion of f with respect to the basis. The real number 
Ff (a@) is called the Fourier coefficient of f on a and it satisfies 


~ 


f) = (f, ba). 


Example 8.15. Fix the Fourier basis as in Example 8.10. Let f : {a, b, c}? > 
{0, 1} be the function which is 1 if and only if both inputs are c. Then you can 
check (Exercise 8.5) that 


a ee) v6 V2 V6 1 
i= 9 18 PU,0) — Tg eo — ghon gbo + Tha. 


12 12 1 
+ Aban + Aban + zh- 


The notation fla) may seem poorly chosen because it doesn’t show the 
dependence on the basis. However, the Fourier formulas we develop in the next 
section will have the property that they are the same for every product Fourier 
basis. We will show a basis-independent way of developing the formulas in 
Section 8.3. 


8.2. Generalized Fourier Formulas 


In this section we will revisit a number of combinatorial/probabilistic notions 
and show that for functions f € L?7(Q", m8"), these notions have familiar 
Fourier formulas that don’t depend on the Fourier basis. 

The orthonormality of Fourier bases gives us some formulas almost imme- 
diately: 
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Proposition 8.16. Let f, g € L7(Q", 1®"). Then for any fixed product Fourier 
basis, the following formulas hold: 


E[f] = fO) 

E[f?] = DS flay (Parseval) 
aeN2,, 

Vari fl = $ f(a)” 

a#+0 

(fg) = X faga) (Plancherel) 
aeN”,, 

Covi f, 8] = X fR). 

a#+0 


Proof. We verify Plancherel’s Theorem, from which the other identities follow 
(Exercise 8.6): 


Fo =| E Aoba > EO) 
aeN?,, BEN m 
= X F@BB)(ba. pp) 
a, BEN” m 
= $ foga) 
acN” 


<m 


by orthonormality of (ha )aen” 


<m 


We now give a definition that will be the key for developing basis- 
independent Fourier expansions. In the case of L?({—1, 1}) this definition 
appeared already in Exercise 3.28. 


Definition 8.17. Let J C [n] and write J = [n] \ J. Given f € L7(Q", 12"), 
the projection of f on coordinates J is the function fS/ € L?(Q", z8”) defined 
by 


JFS E [fax], 
xn OI 
where x; € Q” denotes the values of x in the J-coordinates. In other words, 
f(x) is the expectation of f when the J-coordinates of x are rerandomized. 
Note that we take fS” to have Q” as its domain, even though it only depends 
on the coordinates in J. 
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Forming fS” is indeed the application of a projection linear operator to f, 
namely the expectation over J operator, Ez. We take this as the definition of 
the operator: E7f = fS. When J = {i} is a singleton we write simply E;. 


Remark 8.18. This definition of E; is consistent with Definition 2.23. You are 
asked to verify that Ey is indeed a projection, self-adjoint linear operator in 
Exercise 8.7. 


Proposition 8.19. Let J C [n] and f € L7(Q", z8”). Then for any fixed prod- 
uct Fourier basis, 


(= X Sæ) pa. 
acN” m 
supp(@)CJ 


Proof. Since Ez is a linear operator, it suffices to verify for all œ that 


da if supp(a) C J, 


p = 
0 otherwise. 


If supp(a) C J, then «a does not depend on the coordinates out- 
side J; hence indeed #S/ = pa. So suppose supp(a) Z J. Since a(x) = 


Mes Pa; (x:)) (iez Pa; (Xi )) , we can write dy = da, ` Paz» where Qa, depends 
only on the coordinates in J, @,, depends only on the coordinates in J, and 
E[¢.,] = 0 precisely because supp(a) Z J. Thus for every x € Q”, 


$5) (x) = Elba, ADP N = Ga, (Xs) E [Ge =0 
as needed. 4 
Corollary 8.20. Let f € L?(Q",2®") and fix a product Fourier basis. 


If f depends only on the coordinates in J C [n] then fia) = 0 whenever 
supp(a@) Z J. 


Proof. This follows from Proposition 8.19 because f = fS”. 


Corollary 8.21. Leti € [n] and f € L7(Q", ®"). Then for any fixed product 
Fourier basis, 


BE f= >) Tor 
aa; =0 


Let us now define influences for functions f € L?(Q", x8”). In the case of 
Q = {—1, 1}, our definition of Inf; [ f ] from Chapter 2.2 was E[D; f ]. However, 
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the notion of a derivative operator does not make sense for more general 
domains Q. In fact, even in the case of Q = {—1, 1} it isn’t a basis-invariant 
notion: the choice of foie) fal fa") is inherently 
arbitrary. Instead we can fall back on the Laplacian operators, and take the 


identity Inf;[ f] = (f, L; f) from Proposition 2.26 as a definition. 


rather than 


Definition 8.22. Let i € [n] and f € L*(Q", 1®"). The ith coordinate Lapla- 
cian operator L; is the self-adjoint, projection linear operator defined by 


Lf=f-Ef. 
The influence of coordinate i on f is defined to be 
Inf\[f] = (f, Lif) = (Lif Li f). 
The total influence of f is defined to be I[ f] = pee Inf;[ f]. 
You can think of L; f as “the part of f which depends on the ith coordinate”. 


Proposition 8.23. Leti € [n]and f € L?(Q", x®"). Then for any fixed product 
Fourier basis, 


Lif= X fope Inflfl= X flo, U=} ta fla, 
a:a;~0 a:a;~0 a 


Proof. The first formula is immediate from Corollary 8.21, the second from 
Plancherel, and the third from summing over /. 


Exercise 8.9 asks you to verify the following formulas (cf. Exercise 2.21), 
which are often useful for computations: 


Proposition 8.24. Let i € [n] and f € L7(Q", 1®"). Then 


Inf;[f] — E [Var[f (1, J atts eps Xi, XMitis s+, Xn)]]. 
If furthermore f’s range is {—1, 1}, then 


Inf; f] = ELSI = 2 Pr [SO # SEn i E tit Xn). 
Example 8.25. Let’s continue Example 8.15, in which {a, b, c} has the uni- 
form distribution and f : {a, b, c}? — {0, 1} is 1 if and only if both inputs 
are c. We compute Inf;[f] two ways. Using Proposition 8.24 we have 
Var[ f (x1, a)] = Var[ f(x1, b)] = 0 and Var[ f (x1, c)] = 4 - $ = = (because 
f (x1, c) is Bernoulli with parameter $); thus Infi [f] = i : 5 = 5. Alterna- 
tively, using the formula from Proposition 8.23 as well as the Fourier expansion 
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from eae 8.15, we can me Inf [f] = ( v2 + ( bye + Gs y+ 
(44 v2 + (42 wy +G? = 


Next, we straightforwardly extend our definitions of the noise operator and 
noise stability to general product spaces. 


Definition 8.26. Fix a finite product probability space (Q”, z8”). For p € [0, 1] 
and x € Q” we write y ~ N,(x) to denote that y € Q” is randomly chosen as 
follows: For each i € [n] independently, 


ojx with probability p, 
4 drawn from x with probability 1 — p 
Ifx ~ w®" and y ~ N,(x), we say that (x, y)is a p-correlated pair under n®”. 
(This definition is symmetric in x and y.) 


Definition 8.27. For a fixed space L?(Q”, m8”) and p € [0, 1], the noise oper- 
ator with parameter p is the linear operator T, on functions f € L7(Q", m®”) 
defined by 


TO= E SON 
The noise stability of f at p is 
Stab [f] = (f, T, f) = E Fœ) fO). 


(x,y) p-correlated 
under 7 ®” 


Proposition 8.28. Let p € [0, 1] and let f € L?(Q", n2”). Then for any fixed 
product Fourier basis, 


ToS = >> pM Fla)$u, — StabpLf1= D> p™ Fla). 


CISA aeN” 


Proof. Let J denote a p-random subset of [n]; i.e., J is formed by including 


each i € [n] independently with probability p. Then by definition T, f(x) = 
E,[f</(«)], and so from Proposition 8.19 we get 


DIO=R =E X Fada] = Z p™ Fla) dal 
CISA 
supp(@)CJ 
since for a fixed a, the probability of supp(a) C J is p*”. The formula for 
Stab,[f] now follows from Plancherel. 
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Remark 8.29. The first formula in this proposition may be used to extend the 
definition of T, f to values of p outside [0, 1]. 


We also define p-stable influences. The factor of o~! in our definition is for 


consistency with the L?({—1, 1}”) case. 


Definition 8.30. For f € L?(Q", z8”), p € (0, 1], and i € [n], the p-stable 
influence of i on f is 


Inf,” [f] = p~'Stabp[Lif]= > p*-' flay. 
a:a;~0 


We also define I [f] = $; Inf)” f]. 


Just as in the case of L?({—1, 1}”) we can use stable influences to define 
the “notable” coordinates of a function, of which there is a bounded quantity. 
A verbatim repetition of the proof of Proposition 2.54 yields the following 
generalization: 


Proposition 8.31. Suppose f € L?(Q", n2”) has Var[ f] < 1. Given0 < 8 <1, 
0 <€< 1, let J= {i € [n] : Inf [f] > €}. Then |J| < +. 


We end this section by discussing the “degree” of functions on general prod- 
uct spaces. For f € L?({—1, 1}") the Fourier expansion is a real polynomial; 
this yields an obvious definition for degree. But for general f € L7(Q”, m8”) 
the domain is just an abstract set so we need to look for a more intrinsic 
definition. We take our cue from Exercise 1.10(b): 


Definition 8.32. Let f € L?(Q", m8”) be nonzero. The degree of f, written 
deg( f), is the least k € N such that f is a sum of k-juntas (functions depending 
on at most k coordinates). 


Proposition 8.33. Let f € L7(Q", 2®") be nonzero. Then for any fixed product 
Fourier basis we have deg( f) = max{#a : f(a) 4 0}. 


Proof. The inequality deg( f) < max{#a : fia) # 0} is immediate from the 
Fourier expansion: 


f= Yo fh 
a: f (£0 


and each function fia) a depends on at most #a coordinates. For the reverse 
inequality, suppose f = gı +--+ + 8m where each g; depends on at most k 
coordinates. By Corollary 8.20 each g; has its Fourier support on functions dy 
with #œ < k. But fia) = gi (æ) + --- + Bn(a), so the same is true of f. 
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8.3. Orthogonal Decomposition 


In this section we describe a basis-free kind of “Fourier expansion” for functions 
on general product domains. We will refer to it as the orthogonal decomposition 
of f € L?(Q", 1®"), though it goes by several other names in the literature: 
e.g., Hoeffding decomposition, Efron—Stein decomposition, or ANOVA decom- 
position. The general idea is to express 


oe (8.1) 


S¢[n] 


where each function f=* € L7(Q", m2”) gives the “contribution to f coming 
from coordinates S (but not from any subset of S)”. 

To make this more precise, let’s start with the familiar case of f: 
{—1, 1}" — R. Here it is possible to define the functions f=* : {—1, 1}” > R 
simply by f7S = FS) xs. (Later we will give an equivalent definition that 
doesn’t involve the Fourier basis.) This definition satisfies (8.1) as well as the 
following two properties: 


(1) f=* depends only on the coordinates in S. 
(2) If T Ç S and g is a function depending only on the coordinates in T, 
then (f=5, g) = 0. 


These properties describe what we mean precisely when we say that f= is the 
“contribution to f coming from coordinates S (but not from any subset of S$)”. 
Furthermore, the decomposition (8.1) is orthogonal, meaning (f=*°, f=") = 0 
whenever S Æ T. 

To make this definition basis-free, recall the “projection of f onto coordi- 
nates J”, fS”, from Exercise 3.28 and Definition 8.17. You can think of fS” 
as the “contribution to f coming from coordinates J (collectively)”. It has 
a probabilistic definition not depending on any basis, and with the definition 
fe FS) Xs we have from Exercise 3.28 or Proposition 8.19 that 


FEN S (8.2) 


SCJ 


It is precisely by inverting (8.2) that we can give a basis-free definition of the 
functions f=°. 

Let’s do this inversion for a general f € L?(Q”, m8”). The projection func- 
tions fS” € L?(Q", x2”) can be defined as in Definition 8.17. If we want (8.2) 
to hold for J = Ø then we should define 


fr = f” 
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(which is the constant function equal to E[ f]). Given this, if we want (8.2) to 
hold for singleton sets J = {j}, then we need 


fo} = F? ole fo) es fo} = fou} =, fI. 
In other words, 
Fas E Ix =l E O. 


Notice this function only depends on the input value x ;; it measures the change 
in expectation of f if you know the value x ;. Moving on to sets of cardinality 2, 
if we want (8.2) to hold for J = {i, j}, then we need 


foes} =f DEF Dyf Df {ij} 
Sf SSF N E 


and hence 
fas) = foes = foi = fo} hs f”. 


It’s clear that we can continue this and define all the functions f=° by the 
principle of inclusion-exclusion. To show this definition leads to an orthogonal 
decomposition we will need the following lemma: 


Lemma 8.34. Let f, g € L?(Q", n8”). Assume that f does not depend on any 
coordinate outside I C [n], and g does not depend on any coordinate outside 
J C [n]. Then (f, g) = (fS, g5). 


Proof. We may assume without loss of generality that Z U J = [n]. Given any 
x € Q” we can break it into the parts (xjny, X7\7, xyr). We then have 


(f, 8) = E [f&n xng): 8n, Xr), 


XIN INJ XJI 


where we have abused notation slightly by writing f and g as functions just 
of the coordinates on which they actually depend. Since xy and xj\z are 
independent, the above equals 


E | E [f Œr, xn)] 3 E lein, an| : 


Xing | Xs JM 


But now Es, [f Œn, X1\)] is nothing more than fS!% (x 777), and similarly 
Ex, [¢(¥ ins, X\1)] = g5 (xis). Thus the above equals 


ES enn) A g5 xis) = SINJ gE), 


We can now give the main theorem on orthogonal decomposition: 
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Theorem 8.35. Let f € L7(Q", 2®"). Then f has a unique decomposition as 


f=), s 


S¢[n] 
where the functions f=° € L?(Q", 1®") satisfy the following: 


(1) f=* depends only on the coordinates in S. 
(2) If T Ç S and g € L?(Q", n”) depends only on the coordinates in T, 
then (f=, g) = 0. 


This decomposition has the following additional properties: 


(3) Condition (2) additionally holds whenever S Z T. 
(4) The decomposition is orthogonal: (f=5, f=") = 0 for S # T. 
(5) ser ‘a = f5. 


(6) For each S C [n], the mapping f œ> f= is a linear operator. 


Proof. We first show the existence of a decomposition satisfying (1)—(6). We 
then show uniqueness for decompositions satisfying (1) and (2). As suggested 
above, for each S C [n] we define 


i ar ae 


JES 


where the functions fS” € L?(Q", x8") are as in Definition 8.17. Since 
each fS” depends only on the coordinates in J, condition (1) certainly holds. 
It is also immediate that condition (5) holds by inclusion-exclusion; you are 
asked to prove this explicitly in Exercise 8.14. Condition (6) also follows 
because each f > fS” is a linear operator, as discussed after Definition 8.17. 

We now verify (2). Assume T Ç S and that g € L?(Q", m2”) only depends 
on the coordinates in T. We have 


5S, g) = J ODS, g). (8.3) 


JES 


Take any i € S \ T and pair up the summands in (8.3) as J’, J”, where J’ x i 
and J” = J’ U {i}. By Lemma 8.34 we have 


(FS, g) = (FST, g5) = (FST, gon 


the latter equality using i ¢ T. But the signs (—1)!S'"'/'! and (—1)!91-17"l are 
opposite, so the summands in (8.3) cancel in pairs. This shows the sum is 0, 
confirming (2). 

We complete the existence proof by noting that (2) => (3) = (4) 


(assuming (1)). The first implication is because (f=°, g) = (f75, g£9") 
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when g depends only on the coordinates in T (Lemma 8.34), and SAT CS 
when S Z T. The second implication is because S$ 4 T implies either S$ Z T 
orT ZS. 

It remains to prove the uniqueness statement. Suppose f has two represen- 
tations satisfying (1) and (2). By subtracting them we get a decomposition of 
the O function that satisfies (1) and (2); our goal is to show that each function 
in this decomposition is the 0 function. We can do this by showing that any 
decomposition satisfying (1) and (2) also satisfies “Parseval’s Theorem”: 
(f, f) = S sci F513. But this is an easy consequence of (4), which we 
just noted is itself a consequence of (1) and (2). 


We can connect the orthogonal decomposition of f to its expansion under 
Fourier bases as follows: 


Proposition 8.36. Let f € L*(Q", x2”) have orthogonal decomposition f = 
Di sci] f7>. Fix any Fourier basis ġo, ..., Ọm—1 for L?(Q, x). Then 


f=. Se AO Ge (8.4) 
aéeN",, 
supp(@)=S 
Proof. This follows easily from the uniqueness part of Theorem 8.35. If we 
take (8.4) as the definition of functions f=*, it is immediate that $; f-* = f 
and that f=* depends only on the coordinates in S. Further, if g depends 
only on coordinates T Ç S, then f=5 and g have disjoint Fourier support by 
Corollary 8.20; hence (f=*, g) = 0 by Plancherel (Proposition 8.16). 


Example 8.37. Let’s compute the orthogonal decomposition of the function 
f : fa, b, c}? > {0, 1} from Example 8.15. Recall that in this example {a, b, c} 
has the uniform distribution and f (x1, x2) = 1 if and only if x; = x2 = c. First, 


fY =ELf1 = 5. 
Next, for i = 1, 2 we have that Ff“) is i if x; = c and 0 otherwise; hence 


2: . 
ai +2 ifxj;=c 
={i} er 9 I > 

Ferijs 
9 


else. 


Finally, it’s easiest to compute f=? as f — f™ — f= — f=); this yields 


+ 


if x1 =x.» =C, 
FU AG, x) = 


if exactly one of x1, x2 isc, 


Ole OIN OIF 


+ 


if x1, x2 Ac. 
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You can check (Exercise 8.20) that this is consistent with Proposition 8.36 and 
the Fourier expansion from Example 8.15. 


We can write all of the Fourier formulas from Section 8.2 in terms of the 
orthogonal decomposition; e.g., 


oa SoG mh ei. Tre ee: 
SC{n] S>i SC[n] 

These formulas can be proved either by using the connection from Propo- 

sition 8.36 or by reasoning directly from the defining Theorem 8.35; see 

Exercise 8.18. The orthogonal decomposition also gives us the natural way 

of stratifying f by degree; we end this section by generalizing some more 

definitions from Chapter 1.4: 


Definition 8.38. For f € L?(Q",2®") and k e N we define the degree k 
part of f to be f=* = Visiak fS and the weight of f at degree k to be 
WFF] = || f7* l}. We also use notation like f=* = Š isis f= and W>*[ f] = 


=S 12 
Liston IAW I2 


8.4. p-Biased Analysis 


Perhaps the most common generalized domain in analysis of Boolean functions 
is the case of the hypercube with “biased” bits. In this setting we think of a 
random input in {—1, 1}” as having each bit independently equal to — 1 (True) 
with probability p € (0, 1) and equal to 1 (False) with probability q = 1 — 
p. (We could also consider different parameters p; for each coordinate; see 
Exercise 8.24.) In the notation of the chapter this means LQ", x8”), where 
Q = {—1, 1} and x, is the distribution on Q defined by 2,(—1) = p, 7p(1) = 
q. This context is often referred to as p-biased Fourier analysis, though it 
would be more consistent with our terminology if it were called “u-biased”, 
where 
w= E [x]=q-p=1-2p. 


Xi~Tp 


One of the more interesting features of the setting is that we can fix a combinato- 
rial Boolean function f : {—1, 1}” —> {—1, 1} and then consider its properties 
for various p between 0 and 1; we will discuss this further later in this sec- 
tion. We will also sometimes use the abbreviated notation Pr,,[-] in place of 
Pryor" [-], and similarly Ey, [-]. 

The p-biased hypercube is one of the generalized domains where it can pay 
to look at an explicit Fourier basis. In fact, since we have |Q2| = 2 there is a 
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unique Fourier basis {¢@o, ¢1} (up to negating ¢, ). For notational simplicity we’ ll 
write ¢ instead of ¢; and use “set notation” rather than multi-index notation: 


Definition 8.39. In the context of p-biased Fourier analysis we define the basis 
function @ : {—1, 1} ~ R by 


pa) = =, 
oO 


where 


n= E [x)]=q—p=1-2p,o = stddev[x;] = /4pq = 2./py 1 — p. 


KST 


Note that o? = 1 — yx”. We also have the formula #(1) = ~v p/q, $(—1) = 
=vq/P. 


We will use the notation u and o throughout this section. It’s clear that 
{1, $} is indeed a Fourier basis for L?({—1, 1}, 7p) because E[(x;)] = 0 and 
E[¢(x;)”] = 1 by design. 


Definition 8.40. In the context of L?({—1, 1}", ae) we define the product 
Fourier basis functions (¢s)scjnj by 
s(x) = | [ 60x). 
ieS 
Given f € L?({—1, 1}", me") we write FS) for the associated Fourier coeffi- 
cient; i.e., 
FS)= E [fœ psw). 


Thus we have the biased Fourier expansion 


fx) = X F(S) bs). 


S¢[n] 


Although the notation is very similar to that of the classic uniform- 
distribution Fourier analysis, we caution that in general, 


psr £ Psar- 


Example 8.41. Let x; € L?({(—1, 1}", xg”) be the ith dictator function, 
Xi(x) = x;, viewed under the p-biased distribution. We have 


pai) = 2 


=> xX =U+o(%X), 
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and the latter is evidently f’s (biased) Fourier expansion. That is, 
Ø= u, Ydi}=o, xX;(S) = 0 otherwise. 


This example lets us see a link between a function’s “usual” Fourier expan- 
sion and its biased Fourier expansion. (For more on this, see Exercise 8.25.) 
Let’s abuse notation a little by writing simply ¢; instead of 6(x;). We have the 
formulas 


¢ = — SS x=U+06;, (8.5) 


and we can go from the usual Fourier expansion to the biased Fourier expansion 
simply by plugging in the latter. 


Example 8.42. Recall the “selection function” Sel : {—1, 1} —> {—1, 1} from 
Exercise 1.1(7); Sel(x1, x2, x2) outputs x2 if x; = —1 and outputs x3 if xı = 1. 
The usual Fourier expansion of Sel is 


1 1 1 1 
Sel(x1, x2, x3) = 3X2 + 3X3 — 3X142 + 3X1X%3. 


Using the substitution from (8.5) we get 
Sel(x1, x2, x3) = 3(u + o2) + (u + 03) 
— 3+ opiu + or) + 3U + opiu + Os) 


=u +- imop + (5 + iuo — 407 pip + 40° bids. 
(8.6) 


Thus if we write Sel) for the selection function thought of as an element of 
SIG 1}, Ee), we have 


Sel” G) =p, SAP) = G -— iuo, SAPB) = G + iwo, 
Sel ({1, 2) =—407, Sel ({1, 3}) = 407, Sel (S) = 0 else. 


By the Fourier formulas of Section 8.2 we can deduce, e.g., that E[Sel” ) ]= q4, 
Inf [Sel] = (~40?? + Go’? = 50%, etc. 


Let’s codify a piece of notation from this example: 


Notation 8.43. Let f : {—1, 1}" > Rand let p € (0, 1). We write f for the 
function when viewed as an element of L7({—1, 1}”, xg”). 


We now discuss derivative operators. We would like to define an opera- 
tor D; on L?({—1, 1}", IAE) that acts like differentiation on the biased Fourier 
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expansion. For example, referring to (8.6) we would like to have 
D;Sel” = G + two + ło? gy. 


In general we are seeking on which, by basic calculus and the relationship (8.5), 
satisfies 


0g; 7 0g; l OX; Ox; 
Recognizing a as the “usual” ith derivative operator, we are led to the follow- 
ing: 
Definition 8.44. For i € [n], the ith (discrete) derivative operator D; on 
L?({-1, 1}", ae) is defined by 
fate Dy Sn f(x) 
; 5 . 


Note that this defines a different operator for each value of p. We sometimes 
write the above definition as 


D; f(x) =o 


Do, 


i 


=o -D,,. 


With respect to the biased Fourier expansion of f € L?({-1, 1}", mer) the 
operator D; satisfies 


Dif =X fO bo. (8.7) 


Sai 


Given this definition we can derive some additional formulas for influences, 
including a generalization of Proposition 2.21: 


Proposition 8.45. Suppose f € L7({—1, 1}", xe") is Boolean-valued (i.e., has 
range {—1, 1}). Then 


mnfi[f]=0° Pr (fe) A f°] 
for eachi € [n], and 


I[f]=0° E_[sens;(x)]. 


pn 
If furthermore f is monotone, then Inf; [ f] = of (i ). 
Proof. Using Definition 8.44’s notation and (8.7) we have 


Inf;[ f] = Ely, f] = 0° Ely, f)’. 
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Since (D,, f)? is the 0-1 indicator that i is pivotal for f, the first formula 
follows. The second formula follows by summing over i. Finally, when f is 
monotone we furthermore have that (D,, f)? = Dy, f and hence 


Inf;[ f] = 0° EID; f] = o E[Da f] = o fli), 


as claimed. 


The remainder of this section is devoted to the topic of threshold phenomena 
in Boolean functions. Much of the motivation for this comes from theory of 
random graphs, which we now briefly introduce. 


Definition 8.46. Given an undirected graph G on v > 2 vertices, we identify it 
with the string in {True, False}() which indicates which edges are present (True) 
and which are absent (False). We write &(v, p) for the distribution Tp, ©, this 
is called the Erdős-Rényi random graph model. Note that if we permute the v 
vertices of a graph, this induces a permutation on the (5) edges. A (v-vertex) 
graph property is a Boolean function f : {True, False}(:) — {True, False} that 
is invariant under all v! such permutations of its input; colloquially, this means 
that f “does not depend on the names of the vertices”. 


Graph properties are always transitive-symmetric functions in the sense of 
Definition 2.10. 


Example 8.47. The following are all v-vertex graph properties: 
Conn(G) = True if G is connected; 
3Col(G) = True if G is 3-colorable; 
Clique,(G) = True if G is contains a clique on at least k vertices; 
Maj, (G) = True (assuming n = (3) is odd) if G has at least (5)/2 edges; 
Xtn\(G) = True if G has an odd number of edges. 


Note that each of these actually defines a family of Boolean functions, one for 
each value of v; this is the typical situation in the study of graph properties. An 
example of a function f : {True, False} — {True, False} that is not a graph 
property is the one defined by f (G) = True if vertex #1 has at least one neighbor; 
this f is not invariant under permuting the vertices. 


Graph properties which are monotone are particularly nice to study; these 
are the ones for which adding edges can never make the property go from True 
to False. The properties Conn, Clique,, and Maj,, defined above are all mono- 
tone, as is —3Col. Now suppose we take a monotone graph property, say, Conn. 
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Figure 8.1. Plot of Pra [f Œœ) = True] 
versus p for f a dictator (dotted), f = 
AND) (dashed), and f = Maj;o; (solid) 


Pr[f = True] 
Tp 


A typical question in random graph theory would be, “how many edges does a 
graph need to have before it is likely to be connected?” Or more precisely, how 
does Pre. G, plConn(G) = True] vary as p increases from 0 to 1? 

There’s no need to ask this question just for graph properties. Given any 
monotone Boolean function f : {True, False}” — {True, False} it is intuitively 
clear that when p increases from 0 to 1 this causes Prz, [ f(x) = True] to 
increase from 0 to 1 (unless f is a constant function). As illustration, we show 
a plot of Pr,,[ f(x) = True] versus p for the dictator function, AND2, and 
Majjo1- 

The Margulis—Russo Formula quantifies the rate at which Prz, [f (x) = True] 
increases with p; specifically, it relates the slope of the curve at p to the total 
influence of f under mr: To prove the formula we switch to +1 notation. 


Margulis—Russo Formula. Let f : {—1, 1}" — R. Recalling Notation 8.43 
and the relation u = 1 — 2p, we have 


T E[f] = + LF (8.8) 


In particular, if f : {—1, 1} — {—1, 1} is monotone, then 


d d 1 
— Pr [fœ = 1] = — E[S” = 11. (8.9) 
d P x~nP d H Oo 

Proof. Treating f as a multilinear polynomial over x1, ..., Xn we have 


E[f™)] =T f, ...,1) = f(u, ..., 4) 


(this also follows from Exercise 1.4). By basic calculus, 


d n 
—f(u,....H= X Dy f(u,..., 
aut u) 2 f(u u) 
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But 
Daf Gist.) = ED, f= = E[D,, f 1 = LDO, 
oO oO 


completing the proof of (8.8). As for (8.9), the second equality follows imme- 
diately from Proposition 8.45. The first equality holds because u = 1 — 2p and 
E[ f] = 1 — 2 Pr[ f = —1]; the two factors of —2 cancel. 


Remark 8.48. If f : {True, False}” — {True, False} is a nonconstant monotone 
function, the Margulis-Russo Formula implies that Pr,,[ f(x) = True] is a 
strictly increasing function of p, because I[ f] is always positive. 


Looking again at Figure 8.1 we see that the plot for Maj,,, looks very much 
like a step function, jumping from nearly 0 to nearly | around the critical value 
p = 1/2. For Maj,,, this “sharp threshold at p = 1/2” becomes more and more 
pronounced as n increases. This is clearly suggested by the Margulis—Russo 
Formula: the derivative of the curve at p = 1/2 is equal to I[Maj,,] (the usual, 
uniform-distribution total influence), which has the very large value O(,/n) 
(Theorem 2.33). Such sharp thresholds exist for many Boolean functions; we 
give some examples: 


Example 8.49. In Exercise 8.23 you are asked to show that for every € > 0 
there is a C such that 
Pr [Maj,, = True] < e, Pr [Maj,, = True] > 1 —e. 
T1/2-C/ Jn TT /24+C//n 


Regarding the Erdés—Rényi graph model, the following facts are known: 


if p < 1/4, 


Pr [Clique,,, ,(G) = True] —— 
ae 1 ifp> 1/4. 


G~Gv, p) v—> 00 


log v 


1 ifp> (1+ Sao! 


Pr [Conn(G) = True] —> 
U> o 


0 if p< mx — S88"), 
G~Gov,p) 


In the above examples you can see that the “jump” occurs at various values 
of p. To investigate this phenomenon, we first single out the value for which 
Pr, [f (x) = True] = 1/2: 


Definition 8.50. Let f : {True, False}” — {True, False} be monotone and non- 
constant. The critical probability for f, denoted p,, is the unique value in 
(0, 1) for which Pryor" [ f(x) = True] = 1/2. We also write qe = 1 — Pe, 
Me = qe — Pe = 1 — 2pc, and oc = V4peqe. 


In Exercise 8.27 you are asked to verify that p, is well defined. 
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Looking at the connectivity property from Example 8.49 we see that not 
only does Pr, [Conn = True] jump from near 0 to near 1 in an interval of 
the form pe + o(1), it actually makes the jump in an interval of the form 
Pe(1 + o(1)). This latter phenomenon is (roughly speaking) what is meant by a 
“sharp threshold”. To investigate this further, suppose that f is a (nonconstant) 
monotone function and A is the derivative of Pr,,[ f(x) = True] at p = pe. 
Intuitively, we would expect Pr,,,[ f(x) = True] to jump from near 0 to near 1 
in an interval of around pe of width about | /A. Thus a “sharp threshold” should 
roughly correspond to the case that 1/A is small even compared to min(pe, qe). 
The Margulis—Russo Formula says that A = all f°], and since min(pe, qe) 


is proportional to 4p.q- = o it follows that 1/A is “small” compared to 
min(pe, qe) if and only if I[ f°] is “large”. Thus we have a neat criterion: 


Sharp threshold principle: Let f : {True, False}” — {True, False} be mono- 
tone. Then, roughly speaking, Prr, [| f(x) = True] has a “sharp threshold” if 
and only if f has “large” (“superconstant”) total influence under its critical 
probability distribution. 


Of course this should all be made a bit more precise; see Exercise 8.28 for 
details. In light of this principle, we may try to prove that a given f has a sharp 
threshold by proving that I[ f°] is not “small”. This strongly motivates the 
problem of “characterizing” functions f € L?({—1, 1}", x8”) for which I[f] 
is small. Friedgut’s Junta Theorem, mentioned at the end of Chapter 3.1 and 
proved in Chapter 9.6, tells us that in the uniform distribution case p = 1/2, 
the only way I[f] can be small is if f is close to a junta. In particular, any 
monotone graph property with pe = 1/2 must have a very large derivative 
A Pr, [f = True] at p = pe: since the function is transitive-symmetric, all n 
coordinates are equally influential and it can’t be close to a junta. These results 
also hold so long as p is bounded away from 0 and 1; see Chapter 10.3. 
However, many interesting monotone graph properties have pe very close to 0: 
e.g., connectivity, as we saw in Example 8.49. Characterizing the functions 
fe L7({-1, 1}, te) with small I[ f] when p = o,(1) is a trickier task; see 
the work of Friedgut, Bourgain, and Hatami described in Chapter 10.5. 


8.5. Abelian Groups 


The previous section covered the case of f € L?(Q”, m8”) with |Q| = 2; there, 
we saw it could be helpful to look at explicit Fourier bases. When |Q| > 3 
this is often not helpful, especially if the only “operation” on the domain is 
equality. For example, if f : {Red, Green, Blue}” — R, then it’s best to just 
work abstractly with the orthogonal decomposition. However, if there is a 
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notion of, say, “addition” in Q, then there is a natural, canonical Fourier basis 
for L?(Q, x) when z is the uniform distribution. 

More precisely, suppose the domain Q is a finite abelian group G, with 
operation + and identity 0. We will consider the domain G under the uni- 
form probability distribution zr; this is quite natural because z is translation- 
invariant: m(X) = x(t + X) for any X C G, t € G. In this setting it is more 
convenient to allow functions with range the complex numbers; thus we come 
to the following definition: 


Definition 8.51. Let G be a finite abelian group with operation + and identity 0. 
For n € N* we write L?(G”) for the complex inner product space of functions 
f : G” — C, with inner product 


(f-8) = E SORO. 


x 
Here and throughout this section x ~ G” denotes that x is drawn from the 
uniform distribution on G”. 


Everything we have done in this chapter for the real inner product space 
L?(Q", m2”) generalizes easily to the case of a complex inner product; the 
main difference is that Plancherel’s Theorem becomes 


Ge X AOR Ve 


GeN® m SC[n] 


See Exercise 8.32 for more. 

A natural Fourier basis for L?(G) comes from a natural family of functions 
G — C, namely the characters. These are defined to be the group homomor- 
phisms from G to C*, where C* is the abelian group of nonzero complex 
numbers under multiplication. 


Definition 8.52. A character of the (finite) group G is a function x : G > C”* 
which is a homomorphism; i.e., satisfies x(x + y) = x(x)x(y). Since G is 
finite there is some m € N* such that 0 = x + x +---+-x (m times) for each 
x € G. Thus 1 = x(0) = x(x)”, meaning the range of x is in fact contained 
in the mth roots of unity. In particular, |x(x)| = 1 for all x € G. 


We have the following easy facts: 


Fact 8.53. If x and ¢ are characters of G, then so are X and @- x. 


Proposition 8.54. Let x be a character of G. Then either x = 1 or E[x] = 0. 
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Proof. If x ¥ 1,pick some y € G such that x(y) Æ 1. Since x + y is uniformly 
distributed on G when x ~ G, 


EKOS EX@+ y= EOS x0) E xE. 


Since x (y) Æ 1 it follow that E[x(x)] must be 0. 


Proposition 8.55. The set of all characters of G is orthonormal. (As a conse- 
quence, G has at most dim(L?(G)) = |G] characters.) 


Proof. First, if x is a character, then (x, x) = E[|x|*] = 1 because |x| = 1. 
Next, if @ is another character distinct from x then (¢, x) = E[¢- x]. But 
-X is a character by Fact 8.53, and ġ -X = @/x #1 because ¢@ and x 
are distinct; here we used x = 1/x because |x| = 1. Thus (¢, x) = 0 by 
Proposition 8.54. 


As we will see next, G in fact has exactly |G| characters. It thus follows from 
Proposition 8.55 that the set of all characters (which includes the constant 1 
function) constitutes a Fourier basis for L?(G). 

To check that each finite abelian group G has |G| distinct characters, we 
begin with the case of a cyclic group, Zm for some m. In this case we know 
that every character’s range will be contained in the mth roots of unity. 


Definition 8.56. Fix an integer m > 2 and write w for the mth root of unity 
exp(2zi/m). For 0 < j < m, we define x; : Zn —> C by x;(x) = œ’. It is 
easy to see that these are distinct characters of Zm. 


Thus the functions xp = 1, X1, ..., Xm—1 form a Fourier basis for L?(Zm). 
Furthermore, Proposition 8.13 tells us that we can get a Fourier basis for L7(Z” ) 
by taking all products of these functions. 


Definition 8.57. Continuing Definition 8.56, let n € Nt. For a € N",, we 
define Xa : Z}, > C by 


xa) = | eee: 
j=l 


These functions are easily seen to be (all of the) characters of the group Zi, 
and they constitute a Fourier basis of TALES. 


Most generally, by the Fundamental Theorem of Finitely Generated Abelian 
Groups we know that any finite abelian G is a direct product of cyclic groups 
of prime-power order. In Exercise 8.35 you are asked to check that you get all 
of the characters of G — and hence a Fourier basis for L7(G) — by taking all 
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products of the associated cyclic groups’ characters. In the remainder of the 
section we mostly stick to groups of the form Z”, for simplicity. 

Returning to the characters Xo, ..., Xm—1 from Definition 8.56, it is easy 
to see (using œ” = 1) that they satisfy Xj: xj = Xj+j (mod m) and also 
1/x; = Xj = X-j (mod m). Thus the characters themselves form a group under 
multiplication, isomorphic to Z,,. As in Chapter 3.2, we index them using the 
notation Zm- More generally, indexing the Fourier basis/characters of L7(Z”,) 
by Zn instead of multi-indices, we have: 


Fact 8.58. The characters (Xa)yeq Of Zm form a group under multiplication: 


° Xa: XB = Xat+p> 
° 1/Xa = Xa = X-a- 


As mentioned, the salient feature of L?(G) distinguishing it from other 
spaces L?(Q, 7) is that there is a notion of addition on the domain. This means 
that convolution plays a major role in its analysis. We generalize the definition 
from the setting of F3: 


Definition 8.59. Let f, g € L?(G). Their convolution is the function f * g € 
L?(G) defined by 


Fg) = EIS -y= E[f@— y)g(y)]. 
y~G y~G 


Exercise 8.36 asks you to check that convolution is associative and commu- 
tative, and that the following generalization of Theorem 1.27 holds: 


Theorem 8.60. Let f, g € L?(G). Then F * g(a) = Fla yga). 


We conclude this section by mentioning vector space domains. When doing 
Fourier analysis over the group Z 
are simplest when the only subgroups of Zm are the trivial ones, {0} and Zm; in 
this case, all subgroups will be isomorphic to VA for some n’ < n. Of course, 
this simple situation occurs if and only if m is equal to some prime p. In that 
case, Zp can be thought of as a field, Z}, as an n-dimensional vector space 


n 
m? 


it is natural for subgroups to arise. Things 


over this field, and its subgroups as subspaces. We use the notation F}, in 
this setting and write F7, to index the Fourier basis/characters; this generalizes 
the notation introduced for p = 2 in Chapter 3.2. Indeed, all of the notions 
from Chapters 3.2 and 3.3 regarding affine subspaces and restrictions thereto 
generalize easily to LF"). 
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8.6. Highlight: Randomized Decision Tree Complexity 


A decision tree T for f : {—1, 1}” — {—1, 1} can be thought of as a determin- 
istic algorithm which, given adaptive query access to the bits of an unknown 
string x € {—1, 1}”, outputs f(x). For example, to describe a natural decision 
tree for f = Maj, in words: “Query x1, then x2. If they are equal, output their 
value; otherwise, query and output x3.” For a worst-case input (one where 
xı Æ x2) this algorithm has a cost of 3, meaning it makes 3 queries. The cost 
of the worst-case input is the depth of the decision tree. 

As is often the case with algorithms it can be advantageous to allow random- 
ization. For example, consider using the following randomized query algorithm 
for Maj;: “Choose two distinct input coordinates at random and query them. If 
they are equal, output their value; otherwise, query and output the third input 
coordinate.” Now for every input there is at least a 1/3 chance that the algorithm 
will finish after only 2 queries. Indeed, if we define the cost of an input x to be the 
expected number of queries the algorithm makes on it, it is easy to see that the 
worst-case inputs for this algorithm have cost (1/3) - 2 + (2/3) -3 = 8/3 < 3. 

Let’s formalize the notion of a randomized decision tree: 


Definition 8.61. Given f : {—1, 1}” —> R, a (zero-error) randomized decision 
tree F computing f is formally defined to be a probability distribution over 
(deterministic) decision trees that compute f. The cost of F on input x € 
{—1, 1}” is defined to be the expected number of queries T makes on x when 
T ~ J. The cost of J itself is defined to be the maximum cost of any input. 
Finally, the (zero-error) randomized decision tree complexity of f, denoted 
RDT( f), is the minimum cost of a randomized decision tree computing f. 


We can get further savings from randomization if we are willing to assume 
that the input x is chosen randomly. For example, if x ~ {—1, 1}* is uniformly 
random then any of the deterministic decision trees for Maj, will make 2 queries 
with probability 1/2 and 3 queries with probability 1/2, for an overall expected 
5/2 < 8/3 < 3 queries. 


Definition 8.62. Let J be a randomized decision tree. We define 


eS EE 


x~{- 


A(T) = a ôi(T) = = a [# of coordinates queried by T on x]. (8.10) 
a TT 
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Given f : {—1, 1}" — R, we define A(f) to be the minimum of A(Z) over all 
randomized decision trees J computing f. 

We can also generalize these definitions for functions f € L?(Q, m8”). A 
deterministic decision tree over domain Q2 is the natural generalization in which 
each internal query node has |92| outgoing edges, labeled by the elements of Q. 
We write i” (T), A(T), AM (fF) for the generalizations to trees over Q; in 
the case of L7({—1, 1}", Wee) we use the superscript (p) instead of (z,) for 
brevity. 


It follows immediately from the definitions that for any f € L7(Q", m8”), 


A™(f) < RDT(f) < DT(f). 


Remark 8.63. In the definition of A™(f) it is equivalent if we only allow 
deterministic decision trees; this is because in (8.10) we can always choose the 
“best” deterministic T in the support of F: 


Example 8.64. It follows from our discussions that RDT(Maj;) < 8/3 and 
A(Maj3) < 5/2; indeed, it’s not hard to show that both of these bounds are 
equalities. In Exercise 8.38 you are asked to generalize to the recursive majority 
of 3 function on n = 3¢ inputs; it satisfies DT(Maj?“) = 34 =n, but 


RDT(Maj®%) < (8/3)4 = n?&8/9) x n89, 
A (Maj) < (5/2)4 = n'°85/2) x p83, 


Incidentally, these bounds are not asymptotically sharp; estimating 
RDT(Maj$”) in particular is a well-studied open problem. 


Example 8.65. In Exercise 8.39 you are asked to show that for the logical 
OR function, A(OR,,) = ae, which is roughly 2 for p = 1/2 but is 
asymptotic to n/(2 1n 2) at the critical probability pe. 


Example 8.64 illustrates a mildly surprising phenomenon: using random- 
ness it’s possible to evaluate certain unbiased n-bit functions f while reading 
only a 1/n® fraction of the input bits. This is even more interesting when f 
is transitive-symmetric like Maj’. In that case it’s not hard to show (Exer- 
cise 8.37) that any randomized decision tree J computing f can be converted 
to one where A(Z) remains the same but all ô; (7) are equal to A( f)/n. Then f 
can be evaluated despite the fact that each input bit is only queried with prob- 
ability 1/n°™, 
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In this section we explore the limits of this phenomenon. In particular, a 
longstanding conjecture of Yao (Yao, 1977) says that this is not possible for 
monotone graph properties: 


Yao’s Conjecture. Let f : {—1, 1}" —> {—1, 1} be a nonconstant monotone 
v-vertex graph property, where n = G). Then RDT(f) > Q(n). 


Toward this conjecture we will present a lower bound due to O’Donnell, 
Saks, Schramm, and Servedio (O’°Donnell et al., 2005). (Two other incompa- 
rable bounds are discussed in the notes for this chapter.) It has the advantages 
that it works for the more general class of transitive-symmetric functions and 
that it even lower-bounds A”)( f): 


Theorem 8.66. Let f : {—1, 1}" —> {-—1, 1} be a nonconstant monotone 
transitive-symmetric function with critical probability pe. Then 


AP) f) > (n/a. 


Theorem 8.66 is essentially sharp in several interesting cases. Whenever the 
critical probability pe is @(1/n) or 1 — O(1/n) then oe = @(1/,/n) and The- 
orem 8.66 gives the strongest possible bound, AY°?(f) > Q(n). This occurs, 
e.g., for the OR, function (Example 8.65). Furthermore, Theorem 8.66 can be 
tight up to a logarithmic factor when pe = 1/2 as the following theorem of 
Benjamini, Schramm, and Wilson shows: 


Theorem 8.67. (Benjamini et al., 2005). There exists an infinite family of 
monotone transitive-symmetric functions fa : {—1, 1}" — {—1, 1} with critical 
probability pe = 1/2 and A(f) < O(n?’ logn). 


Theorem 8.66 follows easily from two inequalities (O’ Donnell and Servedio, 
2006, 2007), (O’ Donnell et al., 2005), which we now present: 


OS Inequality. Let f € L7({-1,1}", 18"). Then Yla FÒ < Ifl 


VAO). 


In particular, if f has range {—1,1} and is monotone, then IL f] < 


o JAF). 


OSSS Inequality. Let f € L?(Q", 1®") have range {—1, 1} and let J be any 
randomized decision tree computing f. Then 


n 


Varl f] < )) 3,9) - Inf fi. 


i=l 
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Remark 8.68. An interesting corollary of the OSSS Inequality is that 
MaxInf[f] > Var[ f]/A™(f) = Varl f]/DT(f) = Var[ f]/ deg f’, 


the last inequality assuming Q = {—1, 1}. See Exercise 8.44. 


These two inequalities can be thought of as strengthenings of basic Fourier 
inequalities which take into account the decision tree complexity of f. The 
OS Inequality essentially generalizes the result that majority functions maxi- 
mizes )>;_, FO; i.e., Theorem 2.33. The OSSS Inequality is a generalization 
of the Poincaré Inequality, discounting the influences of coordinates that are 
rarely read. 

We will first derive the query complexity lower bound Theorem 8.66 from 
the OS and OSSS Inequalities. We will then prove the latter two inequalities. 


Proof of Theorem 8.66. We consider f to be an element of L?({—1, 1%, mo): 
Let J be a randomized decision tree achieving A‘”’’(f). In the OSSS Inequality, 
we have Var[f] = 1 since pe is the critical probability and Inf;[ f] = IL f]/n 
for each i € [n] since f is transitive-symmetric. Thus 


r= rary M a ns amp soap”, 


: n 
i=1 


where we used the OS Inequality. The theorem follows by rearranging. 


Now we prove the OS and OSSS Inequalities, starting with the latter. We 
will need a simple lemma that uses the decomposition f = E; f + Li f. 


Lemma 8.69. Let f, g € L7(Q", 1®") and let j € [n]. Given w € Q, write Fo 
for the restriction of f in which the jth coordinate is fixed to value w, and 
similarly for g. Then 


Cov[ f, g] = on [Cov[ fio; 8o']] + (Lj f, Lig). 


independent 


Proof. Since the covariances and Laplacians are unchanged when constants 
are added, we may assume without loss of generality that E[ f] = E[g] = 0. 
Then Cov[ f, g] = (f, g) and 


ELC fio goll = E Lio 8w) — El fiol Elsiw]] 


= E [fios 8w)] — ELSJELg] = E [fies 8w)] = (E; f, Ej). 
Thus the stated equality reduces to the basic (Exercise 8.8) identity 


(f, 8) = (E; f, Ejg) + (L; f, Lje). 
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Proof of the OSSS Inequality. More generally we show that if g : {—1, 1}” > 
{—1, 1} is also an element of L7(Q”, 7®”), then 
Covi f, g1 < > 8P (T) - Inf;[g]. (8.11) 
i=l 
The result then follow by taking g = f. We may also assume that J = T is 
a single deterministic tree computing f; this is because (8.11) is linear in the 
quantities 8’ (J). 

We prove (8.11) by induction on the structure of T. If T is depth-0, then f 
must be a constant function; hence Cov[ f, g] = 0 and (8.11) is trivial. Other- 
wise, let j € [n] be the coordinate queried at the root of T. For each w € Q, 
write T,, for the subtree of T given by the w-labeled child of the root. By 
applying Lemma 8.69 and induction (noting that T, computes the restricted 
function fio), we get 


Covifigl= E [Covi fio. So] + (L; f bjs) 


independent 


E [D PT) Infilgel] + (Lyf. Lis) 


@,0!~T 


ij 
= 5 S(T) - Inf;[g]+ (f,L;g) (in part since E[L;g] = 0) 
ij 


< $ 8P (T); Inf;[g] + EllLjg]|] (since | f| < 1) 
ij 


IA 


= $ ôT) - Inf;[g], 


i=1 


where the last step used 5m) (T) = 1 and Proposition 8.24. This completes the 
inductive proof of (8.11). 


Finally, we prove the OS Inequality. For this we require a definition. 


Definition 8.70. Let (Q, 7) be a finite probability space and T a deterministic 
decision tree over Q. The decision tree process associated to T generates 
a random string x distributed according to x (and some additional random 
variables), as follows: 


(1) Start at the root node of T; say it queries coordinate jı. Choose x ;, ~ x 
and follow the outgoing edge labeled by the outcome. 

(2) Suppose the node of T which is reached queries coordinate j}. Choose 
xj, ~ 7 and follow the outgoing edge labeled by the outcome. 
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(3) Repeat until a leaf node is reached. At this point, define J = 
{Ji; Jo, J3» <- -} G [7] to be the set of coordinates queried. _ 
(4) Draw the as-yet-unqueried coordinates, denoted x7, from med 


Despite the fact that the coordinates x; are drawn in a random, dependent 
order, it’s not hard to see (Exercise 8.42) that the final string x = (x j, x7) is 
distributed according the product distribution z ®”. 


Proof of the OS Inequality. We will prove the claim }`;_; fii) < Ifl- 
„y APF); the “in particular” statement follows immediately from Propo- 
sition 8.45. Fix a deterministic decision tree T achieving A”)(f) (see 
Remark 8.63) and let x = (vj, x7) be drawn from the associated decision 
tree process. Using the notation ġ from Definition 8.39 we have 


Y= E FWL pE ESEE pE. 
S IX XT i=l Jxy *7 j=l 
Here we abused notation slightly by writing f(x z); in the decision tree process, 


f’s value is determined once x y is. Since E[¢(x;)] = 0 for each i g J we may 
continue: 


E [SÆD ELD oa = Ef) YD Wend] 
J *T j=1 XJ i=1 


< [Evera k [(È ten pi )'} 


by Cauchy—Schwarz. Now /E,,[f(x,)*] is simply || f||2 since T com- 
putes f. To complete the proof it suffices to show that 


Per [2 tuenean) | = APP). 
To see this, expand the square: 
n 5 7 
ie: [le Mien G(xi)) | = 2 E Wien oi 


+ pe E „He, i'eJ} A(x; )O(x;)]. 
isi’ 


Conditioned oni € J the quantity E[#(x;)*] is simply 1. Thus 


a E [ienga] = EPt e J1- APF). 
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It remains to show that Ey x ,[1p,'e,6(¥1)b(xi)] = Owheneveri ¥ i’. Sup- 
pose we condition on the event that i, i’ € J and we further condition oni being 
queried before i’ is queried. Certainly this may affect the conditional distribution 
of x;, but the conditional distribution of x; remains xp; hence E[@(x;)] = 0 
under this conditioning. Of course the same argument holds when we condi- 
tion on i’ being queried before i. It follows that Ey x, [1jiire ny O(%)b(xi)] is 
indeed 0, completing the proof. 


8.7. Exercises and Notes 


8.1 Explain how to generalize the definitions and results in Sections 8.1 and 
8.2 to general finite product spaces L?(Q X- X Qa, T1 X+ X Tn). 

8.2 Verify that Definition 8.1 indeed defines a real inner product space. (Where 
is the fact that x has full support used?) 

8.3 Verify the formula for fia) in Definition 8.14. 

8.4 Verify that do, 61, 92 from Example 8.10 indeed constitute a Fourier basis 
for Q = {a, b, c} with the uniform distribution. 

8.5 Verify the Fourier expansion in Example 8.15. 

8.6 Complete the proof of Proposition 8.16. 

8.7 Prove that the expectation over J operator, Ez, is a linear opera- 
tor on L7(Q", m8”) (ie., E(f + g) =E,f + Erg), a projection (i.e., 
E; o E; = E;), and self-adjoint (i.e., (f, Erg) = (E; f, g)). Deduce that 
T, is also self-adjoint. 

8.8 Show for any f, g € L7(Q”, 7®") and j € [n] that f = E;f +L; f and 
that (f, g) = (E; f, Ejg) + (Ly f, Lig). 

8.9 Prove Proposition 8.24. (Hint: Exercise 1.17.) 

8.10 Let f € L?(Q", m8") have range {—1, 1}. Proposition 8.24 tells us that 


Li filly = Li f ls = Inf. 

(a) Show that ||L; f ||} < 2?Inf;[f] for any p > 1. 

(b) In case 1 < p < 2, show that in fact Li fp < Inf;[f]. (Hint: Use 
the general form of Hélder’s inequality to bound ||L; f ||, in terms of 
IL; fll, and IIL; f ll2-) 

8.11 Generalize all of Exercise 2.35 to the setting of L?(Q", x8”). Caution: 
the two statements referring to pọ € [—1, 1] should refer only to p € [0, 1] 
in this more general setting. 


8.12 Assume |Q2| = m and let x denote the uniform distribution on Q. 
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(a) For x € Q” and y ~ N,(x), write a formula for Pr[y; = œ] in terms 
of p (there are two cases depending on whether or not x; = œ). 

(b) Verify that your formula defines a valid probability distribution on Q 
even when -+ < p < 0. We may therefore extend the definition 
of N, to this case. (Cf. the second half of Definition 2.40.) 

(c) Verify that for x ~ 2°” and y ~ N,(x), the distribution of (x, y) is 
symmetric in x and y. 

(d) Show that when y ~ Nat _1 (x), each y; is uniformly distributed on 
Q\ {xi}. 

(e) Verify at the formula for T, from Proposition 8.28 continues to hold 

E L < p < 0. (Hint: Use the fact that it holds for pọ € [0, 1] and 

that the formula in part (a) is a polynomial in p.) 


8.13 Show that Definition 8.30 extends by continuity to 


MOL = Y flay”. 
Ez 


for — 


Extend also Proposition 8.31 to the case of ô = 1. 

8.14 Prove explicitly that condition 5 holds in Theorem 8.35. 

8.15 Prove that condition 6 must hold in Theorem 8.35 directly from the 
uniqueness statement (i.e., without appealing to the explicit construction). 

8.16 Let f € L?(Q", m8"). Prove directly from the defining Theorem 8.35 that 
(f7S)ST is equal to f=* if S C T and is equal to 0 otherwise. 

8.17 Let f € L?(Q", m8”) and let x ~ 2®”. In this exercise you should think 
about how the (conditional) expectation of f changes as the random 
variables x1, ..., X, are revealed one at a time. 

(a) Recalling that fS!"(x) depends only on x;,...,x;, show that the 
sequence of random variables (f SIN(x)),-9. isa martingale (where 
fS denotes f’); i.e., 


ESS) | fa), P= fa) vt e fn. 


(This is the Doob martingale for f.) 
(b) For each t € [n] define 


d,f =f f=- i= > f™. 


SC[n] 
max(S)=t 


Show that Efd, f(x) | f(x), ..., fS“-"(~)] = 0. (Here 
(d; f)=1.... is the martingale difference sequence for f.) 
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8.22 


8.23 


8.24 
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For f, g € L?(Q", m8"), prove the following directly from Theorem 8.35: 


(g) = JO S7, g5) 


S¢[n] 


hff = XO IAS 


Sai 
Ifl= Sok WIS] 
k=0 
Lf = TA = A fS 


Stab, [f] = >> o* - WIS]. 
k=0 


Let f € L?(Q", m2”) and let S C [n]. Show that || f= llo < 2'1 f llo. 
Explicitly verify that Proposition 8.36 holds for the function in Exam- 
ples 8.15 and 8.37. 

Let f € L?(Q", x2”) and let i € S C [n]. Suppose we take f=* and 
restrict its ith coordinate to have value w;, forming the subfunction g = 
(F= )o;. Show that g = g=°\)_ In particular, E[g] = O assuming |S| > 2. 
Let f € L?(Q", m8") be a symmetric function. Show that if 1 < |S] < 
IT| < n, then F Var[fS5] < yẹ Vari fS7]. 

Prove the sharp threshold statement about the majority function made in 
Example 8.49. (Hint: Chernoff bound.) In the social choice literature, this 
fact is known as the Condorcet Jury Theorem. 


Let pi, ..., Pn € (0, 1) and let x = 7p, ®--+ Tp, be the associated prod- 

uct distribution on {—1, 1}”. Write u; = 1 — 2p; ando; = 2./p; V1 — pi. 

Generalize Proposition 8.45 to the setting of L?({-1, 1}", 2). 

Let f : {—1, 1}" — R and consider the general product distribution set- 

ting of Exercise 8.24. 

(a) For S = {ij,..-, ik} C [n], write Dg, for Dg, o -+-+ o Dg, and simi- 
larly D,.,. Show that Dg, = Į [jes 07 - Dxs- 

(b) Writing f% for the function f viewed as an element of 
L?({—1, 1}", x), show that 


FOS) = [Jor Drs ft An) 


ieS 


(c) Show that Jf flo < Ties o: + If lloo- 


8.26 


8.27 


8.28 


8.29 
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(a) Generalize Exercise 2.10 by showing that for f € L?({-1, 1}, x8”) 
with range {—1, 1}, 


Pr. [i is b-pivotal for f on x] = x,(b)Inf;[ f] 


fori € [n] and b € {-1, 1}. 

(b) Generalize Proposition 4.7 by showing thatif f : {—1, 1}” —> {-1, 1} 
has DNFyiain(f) < w, then I[f®] < 4qw < 4w, and if f has 
CNFyiatn(f) < w, then I[ fP] < 4pw < 4w. 

Fix any a € (0, 1). Let f : {True, False}” — {True, False} be a noncon- 

stant monotone function. Show that there exists p € (0, 1) such that 

Pr,,[ f(x) = True] = «æ. (Hint: Intermediate Value Theorem.) 

Fix a small constant 0 < € < 1/2. Let f : {True, False}” — {True, False} 

be a nonconstant monotone function. Let po (respectively, pe, pi) be 

the unique value of p € (0, 1) such that Pr, [f (x) = True] = e (respec- 
tively, 1/2, 1 — e€). (This is a valid definition by Exercise 8.27.) Define 
also of = 4p.(1 — pe). The threshold interval for f is defined to be 

[po, pil, and 6 = pı — po is the threshold width. Now let (fn)nen be a 

sequence of nonconstant monotone Boolean functions (usually “naturally 

related”, with f,,’s input length an increasing function of n). Define the 
sequences po(n), pe(n), pi(n), o2(n), d(n). We say that the family (fn) 
has a sharp threshold if d(n)/o2(n) — 0 as n — œ; otherwise, we say 
it has a coarse threshold. (Note: If p(n) < 1/2 for all n, this is the same 
as saying that 6(n)/p-(n) — 0.) Show that if (fa) has a coarse threshold, 
then there exists C < ov, an infinite sequence nı < n2 < n3 <---, and 

a sequence (p(n;));en such that: 

e e< Propao Sn Œ) = True] < 1 — e for all i; 

e fe") < C for alli. 

(Hint: Margulis—Russo and the Mean Value Theorem.) 

Let f : {—-1, 1}} — {—1, 1} be a nonconstant monotone function and 

let F : [0, 1] — [0, 1] be the (strictly increasing) function defined by 

F(p) = Pra, [f (x) = —1]. Let p. be the critical probability such that 

F(p-) = 1/2. Assume that pe < 1/2. (This is without loss of generality 

since we can replace f by ft. We often think of pe < 1/2.) The goal of 

this exercise is to show a weak kind of threshold result: roughly speaking, 

F(p) = o(1) when p = o(p,) and F(p) = 1 — o(1) when p = @(p-¢). 

(a) Using the Margulis—Russo Formula and the Poincaré Inequality show 
that for all 0 < p < 1, 


FU — F) 


F' 
(p) = mE 
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F(p) 


(b) Show that for all p< pe we have F'(p)> 5= and hence 


(c 


< 


(d) 


(e 


/ 


(p 
2p 
PME (pes. 

Deduce that for any 0 < po < pe we have F(po) < +./Po/Des i.e., 
F (po) < € if po < (2€)’ pe. 

Show that the factor (2€)? can be improved to @(t)e!** for any small 
constant t > 0. (Hint: The quadratic dependence on € arose because 
we used 1 — F(p) > 1/2 for p < pc; but from part (c) we have the 
improved bound | — F(p) > 1 — t once p < (2t) pe.) 

In the other direction, show that so long as pı = oy De < 1/2, 
we have F(p) => 1 —e. (Hint: Work with ln(1 — F(:p)).) In case 
pı < 1/2 does not hold, show that we at least have F(1/2) > 
1 = fpf. 

The bounds in part (e) are not very interesting when pe is close to 1/2. 
Show that we also have F(1 — ô) > 1 — ./8/2 (even when pe = 1/2). 


8.30 Consider the sequence of functions fn : {True, False}” — {True, False} 
defined for all odd n>3 as follows: f,(%1,...,%n) = 
Maj; (x1, x2, Maj, —2(%3, ..-, Xn))- 

(a) Show that f, is monotone and has critical probability pe = 1/2. 

(b) Sketch a plot of Pry, [fn (x) = True] versus p (assuming n very large). 
(c) Show that I[ f ] = @C/n). 

(d) Show that the sequence fa has a coarse threshold as defined in Exer- 


cise 8.28 (assuming € < 1/4). 


8.31 (a) Consider the following probability distributions on strings x € F3: 


(b 


wm 


(1) First choose k ~ {0,1,2,...,} uniformly. Then choose x uni- 
formly from the set of all strings of Hamming weight k. 

(2) First choose a uniformly random “path x from (0,0,...,0) up 
to (1, 1,..., 1)”; i.e., let m be a uniformly random permutation 
from S, and let x=! € F} denote the string whose jth coordinate 
is 1 if and only if w(j) <i. Then choose k ~ {0, 1, 2,..., 7} 
uniformly and let x be the “kth string on the path”, namely 2=*. 

(3) First choose p ~ [0, 1]. Then choose x ~ a 


Show that these are in fact the same distribution. (Hint: Imagine 
choosing n + | indistinguishable points uniformly from [0, 1] and 
then randomly assigning them the labels “p”, 1, 2,..., n.) 

We denote by v” the distribution on Fe! from part (a); more generally, 
we use the notation v™ for the distribution on FY where N is an 
abstract set of cardinality n. Given a nonempty J C [n], show that 
if x ~ v” and xy € F; denotes the restriction of x to coordinates J, 


then x z has the distribution v”. 
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(c) Let f : F5 — Rand fix i € [n]. The ith Shapley value of f is defined 
to be 


Shap, [| f] = Efe) = FPO], 


Show that )~"_, Shap;[f] = f(1, 1,..., 1) — f(0,0,..., 0). 
(d) Suppose f : F5 — {0,1} is monotone. Show that Shap,[/f] = 
4 fi Inf;[ f | dp. 
8.32 Explain how to generalize the definitions and results in Sections 8.1, 8.2 
to the case of the complex inner product space L?(Q”, m8”). In particular, 
verify the following formulas from Proposition 8.16: 


E[f] = FO) 
ESPI =EKS I= X (F@). f@) = X fo 
acN” n aeNr,, 
Var[ f] = (f — EIF], f — ELfl) = 3 If)? 
a% 
(f.8)= X (fa), = > faga) 
aeN",, aN", 
Covi f, g] = (f — Elf]. g —Elgl) = X fase). 
a0 


8.33 (a) As in Exercise 2.58, explain how to generalize the definitions and 
results in Sections 8.1, 8.2 to the case of functions f : Q” > V, 
where V is a real inner product space with inner product (-, -)y. Here 
the Fourier coefficients fia) will be elements of V, and (f, g) is 
defined to be E,.~7e[( f(x), g(x)) v]. In particular, verify the formulas 
from Proposition 8.16, including the Placherel Theorem (f, g) = 
Dal f(a), (@))v- 

(b) For È a finite set we write Ay for the set of all probability dis- 
tributions over & (cf. Exercise 7.22). Writing |X| =m, we also 
identify Ay with the standard convex simplex in R”, namely 
{u € R” : wy +--+ Um = 1, ui = 0 Vi} (where we assume some 
fixed ordering of &). Finally, we identify the m elements of X with 
the constant distributions in Ay; equivalently, the vertices of the 
form (0,...,0, 1,0,..., 0). Given a function f : Q” —> £, often the 
most useful way to treat it analytically is to interpret it as a function 
f : Q” — Ay C R” and then use the setting described in part (a), 
with V = R”. Using this idea, show that if f : Q” > È and v isa 
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distribution on Q, then 


Stabl f= Pry Le) = SO) 
(Here in Stab,[f] we are interpreting f’s range as Ay C R”, 
whereas in the expression f(x) = f(y) we are treating f’s range 
as the abstract set X.) 

We say a function f € L7(Q”, 2®") is a linear threshold function if it is 

expressible as f(x) = sgn(€(x)), where £ : Q” — R has degree at most 1 

(in the sense of Definition 8.32). 

(a) Given ot), w” e Q” and x € {—1, 1}”, we introduce the notation 

ow” for the string ae”, ..., 0) € Q”. Show that if ot), oY ~ 

x8” are drawn independently and (x, y) ~ {—1, 1}” x {-1, 1}" isa 
p-correlated pair of binary strings, then (@®, @) is a p-correlated 
pair under z ®”. 

Let fe L?(Q", 7®") be a linear threshold function. Given a 

pair ot), œD e Q”, define geed od : {-1, 1}" > {-1, 1} by 

But) ov (xX) = f(@™). Show that gaed a-n is a linear threshold 

function in the “usual” sense. 

Prove that Peres’s Theorem (from Chapter 5.5) applies to linear 

threshold functions in L?(Q", z8”), with the same bounds. 

Let G be a finite abelian group. We know by the Fundamental Theorem of 

Finitely Generated Abelian Groups that G = Zm, X --- Zm, where each 

mj is a prime power. 

(a) Givena € G, define Xa : G > C by 


(b 


wm 


(c 


wm 


Xa) = | | exp(rix;/mj). 


j=l 


Show x, is a character of G and that the x,’s are distinct functions 
for distinct w’s. Deduce that the set of all x,’s forms a Fourier basis 
for L7(G). 

Show that this set of characters forms a group under multiplication 
and that this group is isomorphic to G; i.e., generalize Fact 8.58. This 
is called the dual group of G and it is written G. We also identify the 
characters in G with their indices w. 


(b 


wm 


Verify that the convolution operation on L*(G) is associative and com- 

mutative, and that it satisfies Fx g(a) = flayela) for all a € G. (See 

Exercise 8.35 for the definition of G ) 

(a) Let fe L?(Q", m8”) be any transitive-symmetric function and let 7 
be a randomized decision tree computing f. Show that there exists a 
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randomized decision tree J computing f with A(T) = A™ (J) 
and such that oo? (J) is the same for all i € [n]. (Hint: Randomize 
over the automorphism group Aut( f) and use Exercise 2.47.) 
Given a randomized decision tree J; let 6™(.F) = maxjen{ S(T}. 
Given f € L?({—1, 1}", 7%"), define 6 (f) to be the minimum 
value of am (JF) over all J which compute f; this is called the 
revealment of f. Show that if f is transitive-symmetric, then 
SOF) = IAO f). 
8.38 (a) Show that DT(Maj$“) = 34, RDT(Maj?“) < (8/3)?, and 
A(Maj$“) < (5/2)¢. 
(b) Show that RDT(Maj?”) < (8/3). How small can you make your 
upper bound? 


(b 


wm 


8.39 (a) Show that for every deterministic decision tree T computing the 
logical OR function on n bits, 


AY(T) = p-1+(1— p)p-2+(— py p-3+-- 
1-0-7 
p 


elep Ap Ga De Pan 


Deduce A‘”)(OR,,) = = 

(b) Show that A%°(OR,,) ~ n/(21n2) as n —> œ, where p, denotes the 
critical probability for OR,,. 

8.40 Let NAND : {True, False}? — {True, False} be the function that outputs 

True unless both its inputs are True. 

(a) Show that for d even, NAND®? = Tribes$ 3’ a (Thus the recursive 
NAND function is sometimes known as the AND-OR tree.) 

(b) Show that DT(NAND®2) = 27. 

(c) Show that RDT(NAND) = 2. 

(d) For b € {True, False} and J a randomized decision tree computing 
a function f, let RDT,(2%) denote the maximum cost of F among 
inputs x with f(x) = b. Show that there is a randomized decision 
tree J computing NAND with RDT ratse(F) = 3/2. 

(e) Show that RDT(NAND®”) < 3. 

(f) Show that there is a family of randomized decision trees (Zq)¢en+, 
with J, computing NAND®”, satisfying the inequalities 


RDTraise( Fa) = 2RDTtue(-Fa-1) 
RDTyue( Fa) < RDTratse(Fa_1) + (1/2)RDTtue(Fa_1)- 


(g) Deduce RDT(NAND®*) < (8) ~ n74, where n = 24, 
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Let € = {monotone f : {—1, 1}” —> {-1, 1} | DT( f) < k}. Show that @ 
is learnable from random examples with error € in time n?\*/9, (Hint: 
OS Inequality and Corollary 3.32.) 

Verify that the decision tree process described in Definition 8.70 indeed 
generates strings distributed according to x8”. (Hint: Induction on the 
structure of the tree.) 

Let T be a deterministic decision tree of size s. Show that A(T) < logs. 
(Hint: Let P be a random root-to-leaf path chosen as in the decision tree 
process. How can you bound the entropy of the random variable P?) 


Let fe L?(Q", m8") be a nonconstant function with range {—1, 1}. 

(a) Show that MaxInf[ f] > Var[f]/A™(/) (cf. the KKL Theorem 
from Chapter 4.2). 

(b) Incase Q = {—1, 1} show that MaxInf[ f] > Var[f]/ deg( f}. (You 
should use the result of Midrijānis mentioned in the notes in Chap- 
ter 3.6.) 

(c) Show that I[f] > Var[ f]/6™(/), where 6™(f) is the revealment 
of f, defined in Exercise 8.37(b). 

Let f € L?(Q", 1®") have range {—1, 1}. 

(a) Let F be a randomized decision computing f and let i € [n]. Show 
that Inf;[f] < S f). (Hint: The decision tree process.) 

(b) Suppose f is transitive-symmetric. Show that A™(f)> 
„/Var[f]/n. (Hint: Exercise 8.37(b).) This result can be sharp up 
to an O(./logn) factor even for an f : {—1, 1}” > {—1, 1} with 
Var[ f] = 1; see (Benjamini et al., 2005). 

In this exercise you will give an alternate proof of the OSSS Inequality that 

is sharp when Var[ f] = 1 and is weaker by only a factor of 2 when Var[ f ] 

is small. Let f € L?7(Q", m8") have range {—1, 1}. Given a randomized 

decision tree J we write err(Z) = Pry~zen[F(x) Æ f(x)]. 

(a) Let T be a depth-k deterministic decision tree (not necessarily com- 
puting f) whose root queries coordinate i. Let J be the distribution 
over deterministic trees of depth at most k — 1 given by following 
a random outgoing edge from T’s root (according to zr). Show that 
err(F) < err(T) + 4Inf;[ f]. 

(b) Let J be a randomized decision tree of depth 0. Show that err(7) > 
min{Pr[ f(x) = 1], Pr[ fŒ) = —1]}. 

(c) Prove by induction on depth that if Jis any randomized decision tree, 
then 


5) 8; (L)-Infi[ f] = min{Pr[ f(x) = 1], Pri f(x) = —1]} err). 


i=1 
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Figure 8.2. The basis for a counterexample to the OSSS Inequality when 
f:{-1,1} > R 


Verify that this yields the OSSS Inequality when Var[ f] = 1 and in 
general yields the OSSS Inequality up to a factor of 2. 


8.47 Show that the OSSS Inequality fails for functions f :{—1, 1}” > R. 
(Hint: The simplest counterexample uses a decision tree with the shape 

in Fig. 8.2.) 
Can you make the ratio of the left-hand side to the right-hand side 


130+20 V3 9 9 
equal to ~? Larger? 


Notes 


The origins of the orthogonal decomposition described in Section 8.3 date back to 
the work of Hoeffding (Hoeffding, 1948) (see also von Mises (von Mises, 1947)). 
Hoeffding’s work introduced U-statistics, i.e., functions f of independent random vari- 
ables X;,...,X, of the form avgy<;, <...cip<n 8(Xi,,---, Xi), where g: R? > R is 
a symmetric function. Such functions are themselves symmetric. For these functions, 
Hoeffding introduced fS (which, by symmetry, depends only on |S]) and proved cer- 
tain inequalities (e.g., those in Exercise 8.22) relating Var[ f] to the quantities || f£°||3, 
II f= I|}. Nonsymmetric functions f were considered only rarely in the subsequent three 
decades of statistics research. One notable exception comes in the work of Hajek (Hajek, 
1968), who effectively introduced f=!, known as the Hájek projection of f . Also, a work 
of Bourgain (Bourgain, 1979) essentially describes the decomposition f = 5°, f=. 
The first work that mentions the general orthogonal decomposition for not-necessarily- 
symmetric functions appears to be that of Efron and Stein (Efron and Stein, 1981) from 
the late 1970s. Efron and Stein’s description is brief; the subsequent work of Karlin and 
Rinott (Karlin and Rinott, 1982) gives a more thorough development. Efron and Stein’s 
main result was a proof of the statement Var[ f] < I[f] for symmetric f; in the statistics 
literature this is known as the Efron—Stein Inequality. Steele (Steele, 1986) extended this 
to the case of nonsymmetric f by a simple proof that used the Fourier basis approach 
to orthogonal decomposition. This approach via Fourier bases originated in the work of 
Rubin and Vitale (Rubin and Vitale, 1980); see also Takemura (Takemura, 1983) and 
Vitale (Vitale, 1984). The terminology “Fourier basis” we use is not standard. 

The p-biased hypercube distribution is strongly motivated by the Erdés—Rényi (Erdős 
and Rényi, 1959) theory of random graphs (see e.g., Bollobás and Riordan (Bollobas 
and Riordan, 2008) for history) and by percolation theory (introduced in Broadbent and 
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Hammersley (Broadbent and Hammersley, 1957)). Influences under the p-biased distri- 
bution — and their connection to threshold phenomena — were studied by Russo (Russo, 
1981, 1982). The former work proved the Margulis—Russo formula independently of 
Margulis, who had proven it earlier (Margulis, 1974). Fourier analysis under the p- 
biased distribution seems to have been first introduced to the theoretical computer sci- 
ence literature by Furst, Jackson, and Smith (Furst et al., 1991), who extended the LMN 
learning algorithm for AC° to this setting. Talagrand (Talagrand, 1993, 1994) developed 
p-biased Fourier for the study of threshold phenomena, strengthening Margulis and 
Russo’s work and proving the KKL Theorem in the p-biased setting. Similar results 
were obtained by Friedgut and Kalai (Friedgut and Kalai, 1996) using an earlier work 
of Bourgain, Kahn, Kalai, Linial, and Katznelson (Bourgain et al., 1992) that proved a 
version of the KKL Theorem in the setting of general product spaces. The statements 
about sharp thresholds for cliques and connectivity in Example 8.49 are essentially due 
to Matula and to Erd6és—Rényi, respectively; see, e.g., Bollobas (Bollobas, 2001). Weak 
threshold results similar to the ones in Exercise 8.29 were proved by Bollobas and 
Thomason (Bollobas and Thomason, 1987), using the Kruskal—Katona Theorem rather 
than the Poincaré Inequality. 

Fourier analysis on finite abelian groups — and more generally, on locally compact 
abelian groups — is an enormous subject upon which we have touched only briefly. We 
cannot survey it here but refer instead to the standard textbook of Rudin (Rudin, 1962) 
and to the reader-friendly textbook of Terras (Terras, 1999), which focuses on finite 
groups. 

One of the earliest works on randomized decision tree complexity is that of Saks 
and Wigderson (Saks and Wigderson, 1986); they proved the contents of Exercise 8.40. 
(We note that RDT(f) is usually denoted R(f) in the literature, and DT(f) is usu- 
ally denoted D(f).) One basic lower bound in the area is that RDT(f) > /DT(f) 
for any f : {—1, 1}" — {—1, 1}; in fact, this lower bound holds even for “nonde- 
terministic decision tree complexity”, as proved in (Blum and Impagliazzo, 1987; 
Tardos, 1989). Yao’s Conjecture is also sometimes attributed to Richard Karp. Regard- 
ing the recursive majority-of-3 function, Ravi Boppana was the first to point out 
that RDT(Maj@“) = 0(3%) even though DT(Maj?”) = 37. Saks and Wigderson noted 
the bound RDT(Maj?") < (8/3)¢ and also that it is not optimal. Following subse- 
quent works (Jayram et al., 2003; Sherman, 2008) the best known upper bound is 
O(2.657) (Magniez et al., 201 1) and the best known lower bound is 82(2.557) (Leonardos, 
2012). 

The proof of the OSSS Inequality we presented is essentially due to Lee (Lee, 2010); 
the alternate proof from Exercise 8.46 is due to Jain and Zhang (Jain and Zhang, 2011). 
The Condorcet Jury Theorem is from (de Condorcet, 1785). The Shapley value described 
in Exercise 8.31 was introduced by the Nobelist Shapley (Shapley, 1953); for more, see 
Roth (Roth, 1988). Exercise 8.34 is from Blais, O’ Donnell, and Wimmer (Blais et al., 
2010). Exercises 8.37(a) and 8.45 are from Benjamini, Schramm, and Wilson (Benjamini 
et al., 2005); the term “revealment” was introduced by Schramm and Steif (Schramm 
and Steif, 2010). Exercise 8.47 is from (O’Donnell et al., 2005). Related to this, it 
is extremely interesting to ask whether something like the result of Exercise 8.44(b) 
holds for functions f : {—1, 1}" —> [—1, 1]. It has been suggested that the answer is 
yes: 


Aaronson—Ambainis Conjecture. (Aaronson, 2008; Aaronson and Ambainis, 2011) 
Let f : {-1, 1}" = [-1, 1]. Then MaxInf[f] > poly(Var[f]/ deg(f)). 
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If true, this conjecture would have significant consequences regarding the limitations of 
efficient quantum computation; see Aaronson and Ambainis (Aaronson and Ambainis, 
2011). The best result in the direction in the direction of the conjecture is MaxInf[ f] > 
poly(Var[ f]/2°2), due to Dinur et al. (Dinur et al., 2007). 
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Basics of Hypercontractivity 


In 1970, Bonami proved the following central result: 


The MHypercontractivity Theorem. Let /f :{-—1,1}"—R and let 
1 < p <q < œ. Then||Tpfllq <lIfllpfor0 < p < 2. 


As stated, this theorem may look somewhat opaque. In this chapter we 
consider some special cases of it that are easier to understand, easier to prove, 
and that encompass almost all of the theorem’s uses. The proof of the full 
theorem is deferred to Chapter 10. The special cases in this chapter are the 
following: 


Bonami Lemma. Let f : {—1,1}” > R have degree k. Then ||fll4 < 


V3 If ll. 


The fundamental idea of this statement is that if x ~ {—1,1}” and f: 
{—1, 1}" — R has low degree then the random variable f(x) is quite “rea- 
sonable”; e.g., it is “nicely” distributed around its mean. The Bonami Lemma 
has a very easy inductive proof and is already powerful enough to obtain many 
of the well-known applications of “hypercontractivity”, including the KKL 
Theorem (proven at the end of this chapter) and the Invariance Principle. 


(2, q)-Hypercontractivity Theorem. Let f : {—1,1}” > Randlet2<q< 
oo. Then ||T1; ygatf lla < If ll2- As a consequence, if f has degree at most k 


k 
then || fla < Va = 1 Wf illo. 


This theorem quantifies the extent to which T, is a “smoothing” operator; 
equivalently, it gives even more control over the “reasonableness” of low- 
degree polynomials. Its consequences include a generalization of the Level-1 
Inequality (from Chapter 5.4) to “Level-k Inequalities”, as well as a Chernoff- 
like tail bound for low-degree polynomials of random bits. 
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(p, 2)-Hypercontractivity Theorem. Let f : {—1, 1}" — Randlet1 < p <2. 
Then |T p= f \l2 < If llp- Equivalently, Stab [f] < ISi for 0 <p<l. 


This theorem is actually “equivalent” to the (2, g)-Hypercontractivity Theorem 
by virtue of Hélder’s inequality. When specialized to the case of f : {—1, 1}” > 
{0, 1} it gives a precise quantification of the fact that the “noisy hypercube 
graph” is a “small-set expander”. Qualitatively, this means that if A C {—1, 1}” 
is “small”, x ~ A, and y ~ N,(x), then y is very unlikely to be in A. 


9.1. Low-Degree Polynomials Are Reasonable 


As anyone who has worked in probability knows, a random variable can some- 
times behave in rather “unreasonable” ways. It may be never close to its expec- 
tation. It might exceed its expectation almost always, or almost never. It might 
have finite 1st, 2nd, and 3rd moments, but an infinite 4th moment. All of this 
poor behavior can cause a lot of trouble — wouldn’t it be nice to have a class of 
“reasonable” random variables? 

A very simple condition on a random variable that guarantees some good 
behavior is that its 4th moment is not too large compared to its 2nd moment. 


Definition 9.1. For a real number B > 1, we say that the real random variable 
X is B-reasonable if E[X*] < BE[X’. (Equivalently, if || X ||4 < B!/4||X||2.) 


The smaller B is, the more “reasonable” X is. This definition is scale- 
invariant (i.e., cX is B-reasonable if and only if X is, for c #0) but not 
translation-invariant (c + X and X may not be equally reasonable). The lat- 
ter fact can sometimes be awkward, a point we’ll address further in Sec- 
tion 9.3. Indeed, we’ll later encounter a few alternative conditions that also 
capture “reasonableness”. For example, in Chapter 11 we’ll consider the anal- 
ogous 3rd moment condition, E[|X|*] < B E[X7]*/*. Strictly speaking, the 
4th moment condition is stronger: if X is B-reasonable, then 


E[|X[}] = E[|X| - X?] < (E[X?]VE[X4] < VB E(X’ T’; 


on the other hand, there exist random variables with finite 3rd moment and 
infinite 4th moment. However, such unusual random variables almost never 
arise for us, and morally speaking the 4th and 3rd moment conditions are about 
equally good proxies for reasonableness. 


Example 9.2. If x ~ {—1, 1} is uniformly random then x is 1-reasonable. 
If g ~ N(O, 1) is a standard Gaussian, then E[g*] = 3, so g is 3-reasonable. 
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If u ~ [—1, 1] is uniform, then you can calculate that it is 2-reasonable. In 
all of these examples B is a “small” constant, and we think of these random 
variables simply as “reasonable”. An example of an “unreasonable” random 
variable would be highly biased Bernoulli random variable; say, Pr[y = 1] = 
2", Pr[y = 0] = 1 — 27”, where n is large. This y is not B-reasonable unless 
B> 2". 


Let’s give a few illustrations of why reasonable random variables are nice 
to work with. First, they have slightly better tail bounds than what you would 
get out of the Chebyshev inequality: 


Proposition 9.3. Let X 4 0 be B-reasonable. Then Pr[{|X| > t\|X||2] < B/t* 
forallt > 0. 


Proof. This is immediate from Markov’s inequality: 


Pr[|X| > tX lo] = Pr[X* > || X31 < Bi ee 
= 2| = 2 2 < FED? = 


More interestingly, they also satisfy anticoncentration bounds; e.g., you can 
upper-bound the probability that they are near 0. 


Proposition 9.4. Let X #0 be B-reasonable. Then it holds that 
Pr[|X| > ¢||X|l2] = —27)?/B for allt € [0, 1]. 


Proof. Applying the Paley-Zygmund inequality (also called the “second 
moment method”) to X?, we get 


2 
EX’? d as 


Pr[|X| > ¢||X ll2] = PrlX? >? ELX7]] > (1 YY aay > z 


For a generalization of this proposition, see Exercise 9.12. 
For a discrete random variable X, a simple condition that guarantees reason- 
ableness is that X takes on each of its values with nonnegligible probability: 


Proposition 9.5. Let X be a discrete random variable with probability mass 
function n. Write 


à = min(m) = ee =x]}. 


Then X is (1/à)-reasonable. 
Proof. Let M = ||X||.o. Since Pr[|X| = M] > à we get 


E[X’] >AM? = M? <E[X’]/a. 
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On the other hand, 


E[X*] = E[X? - X?] < M? . E[X’], 


and thus E[X*] < (1/4) E[X?]? as required. 


The converse to Proposition 9.5 is certainly not true. For example, if X = 
Fri +- + Jr*n where x ~ {—1, 1}”, then X is very close to a standard 
Gaussian random variable (for n large) and is, unsurprisingly, 3-reasonable. On 
the other hand, the “A” for this X is tiny, 2~”. 

This discussion raises the issue of how you might try to construct an unrea- 
sonable random variable out of independent uniform +1 bits. By Proposi- 
tion 9.5, at the very least you must use a lot of them. Furthermore, it also 
seems that they must be combined in a high-degree way. For example, to con- 
struct the unreasonable random variable y from Example 9.2 requires degree n: 
y= (+x) +42)---U+%,)/2". 

Indeed, the idea that high degree is required for unreasonableness is correct, 
as the following crucial result shows: 


The Bonami Lemma. For each k, if f : {—1, 1}” —> R has degree at most k 
and X1,...,Xy are independent, uniformly random +1 bits, then the random 


variable f(x) is 9 reasonable, i.e., 


EL) <% EL? <> I flla< V3 Ifl 


In other words, low-degree polynomials of independent uniform +1 bits are 
reasonable. As we will explain later, the Bonami Lemma is a special case of 
more general results in the theory of “hypercontractivity”. However, many key 
theorems using hypercontractivity — e.g., the KKL Theorem, the Invariance 
Principle — really need only the simple Bonami Lemma. (We should also note 
that the name “Bonami Lemma” is not standard; however, the result was first 
proved by Bonami and it’s often used as a lemma, so the name fits. See the 
discussion in the notes in Section 9.7.) 

One pleasant thing about the Bonami Lemma is that once you decide to 
prove it by induction on n, the proof practically writes itself. The only “non- 
automatic” step is an application of Cauchy—Schwarz. 


Proof of the Bonami Lemma. We assume k > | as otherwise f must be con- 
stant and the claim is trivial. The proof is by induction on n. Again, if 
n =Q, then f must be constant and the claim is trivial. For n > 1 we 
can use the decomposition f(x) = x,D, f(x) + En f(x) (Proposition 2.24), 
where deg(D, f) < k — 1, deg(E, f) < k, and the polynomials D, f(x) and 
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En f(x) don’t depend on xn. For brevity we write f = f(x), d = D, f(x), and 
e=E, f(x). Now 
E[f*] = El(x,d + e)'] 
= E[xtd*) + 4E[x3d'e] + 6 E[x2d°e7] + 4E[x,de*] + Efe*] 
= E[x4] E[d*] + 4E[x3] E[d e] + 6 E[x?] E[d e°] + 4E[x,] Elde*] 
+ Efe’). 


In the last step we used the fact that x, is independent of d and e, since 
D, f and E, f do not depend on x,. We now use E[x,] = E[x?] = 0 and 
E[x2] = E[xź] = | to deduce 


E[ f*] = E[d*] + 6 E[d’e?] + E[e*]. (9.1) 
A similar (and simpler) sequence of steps shows that 
E[f?] = E[d?] + E[e’]. (9.2) 


To upper-bound (9.1), recall that d = D, f(x) where D, f is a multilinear 
polynomial of degree at most k — 1 depending on n — 1 variables. Thus we 
can apply the induction hypothesis to deduce E[{d*] < %-! E[d’/. Similarly, 
E[e*] < 9‘ E[e?]? since deg(E, f) < k. To bound E[d’e?] we apply Cauchy— 
Schwarz, getting v E[d*],/E[e*] and letting us use induction again. Thus we 
have 


ELS] < 9%! Eld? + 6,/ 9%! Eld?]2 V% Ee2)2 + 9* Efe??? 
<9% (Eia? + 2E[d2]Ele2] + Ele’) = 9% (Eld”) cee Ele), 


where we used 9%-! E[d?]? < 9% E[d’|?. In light of (9.2), this completes the 
proof. 


Some aspects of the sharpness of the Bonami Lemma are explored in Exer- 
cises 9.2, 9.3, 9.37, and 9.38. Here we make one more observation. At the end 
of the proof we used the wasteful-looking inequality 9*—! E[d 22 < 9% Ed’. 
Tracing back through the proof, it’s easy to see that it would still be valid even if 
we just had E[x4] < 9 rather than E[x4] = 1. For example, the Bonami Lemma 
holds not just if the x;’s are random bits, but if they are standard Gaussians, or 
are uniform on [—1, 1], or there are some of each. We leave the following as 
Exercise 9.4. 
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Corollary 9.6. Let x), ...,X, be independent, not necessarily identically dis- 
tributed, random variables satisfying E[x;] = E[x?] = 0. (This holds if, e.g., 
each —x; has the same distribution as xi.) Assume also that each x; is B- 
reasonable. Let f = F(x\,...,Xn), where F is a multilinear polynomial of 
degree at most k. Then f is max(B, 9)*-reasonable. 


As a first application of the Bonami Lemma, let us combine it with Propo- 
sition 9.4 to show that a low-degree function is not too concentrated around its 
mean: 


Theorem 9.7. Let f : {—1, 1}" — R be a nonconstant function of degree at 
most k; write u = E[ f] and o = y Var[ f |. Then 


Pr fœ al> 3012 769 
Proof. Let g = +(f — u), a function of degree at most k satisfying ||g||2 = 1. 


By the Bonami Lemma, g is 9‘-reasonable. The result now follows by applying 
Proposition 9.4 to g with t = L, 


Using this theorem, we can give a short proof of the FKN Theorem from 
Chapter 2.5: If f :{—1, 1}” —> {—1, 1} has wiif] = ] — ô then f is O(6)- 
close to +x; for some i € [n]. 


Proof of the FKN Theorem. Write £ = f=!, so EJE] = 1 — 6 by assumption. 
We may assume without loss of generality that 6 < To- The goal of the proof 
is to show that Var[€7] is small; specifically we’ll show that Var[€7] < 64006. 
This will complete the proof because (using Exercise 1.20 for the first equality 


below) 


Pa ~ Te. 2 are ny 22 
Vart) = X Fw FUY = (L Fw?) - X Flot =a- - ¥ Fo 
iZj i= i= i= 
> (1-28) - > fü 
i=1 


and hence Var[¢7] < 64008 implies 


1 — 64025 < F fli) < max{ Fl?) YF? < max{ FU} < maxt| FON, 


i=l i=l 


as required. 
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To bound Var[¢7] we first apply Theorem 9.7 to the degree-2 function ¢?; 
this yields 


Now suppose by way of contradiction that Var[¢7] > 64006; then the above 
implies 


aL Pr| |? (is) 405] < Pr| |e ie 39/3]. (9.3) 


This says that |¢| is frequently far from 1. Since |f| = 1 always, we can 
deduce that |f — £|? is frequently large. More precisely, a short calcula- 
tion (Exercise 9.5) shows that (f — L? > 1695 whenever |€? — 1| > 39/5. 
But now (9.3) implies E[(f — e] in - 1696 > ô, a contradiction since 
E[(f — 2°] = 1 — W'[f] = ô by assumption. 


9.2. Small Subsets of the Hypercube Are Noise-Sensitive 


An immediate consequence of the Bonami Lemma is that for any f: 
{-1,1}" > Randk € N, 


IT af la = Sell f lla < FM Mb. (9.4) 


This is a special case of the (2, 4)-Hypercontractivity Theorem (whose name 
will be explained shortly), which says that the assumption of degree-k homo- 
geneity is not necessary: 


(2, 4)-Hypercontractivity Theorem. Let f : {—1, 1}” > R. Then 


IT v3 F lla < Ifl. 


It almost looks as though you could prove this theorem simply by sum- 
ming (9.4) over k. In fact that proof strategy can be made to work given a 
few extra tricks (see Exercise 9.6), but it’s just as easy to repeat the induction 
technique used for the Bonami Lemma. 


Proof. We’ll prove EIT, zf] < E[ f(x using the same induction as 
in the Bonami Lemma. Retaining the notation d and e, and using the shorthand 
T= Ti we have 


om 1 
Tf = xn: -Td + Te. 
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Similar computations to those in the Bonami Lemma proof yield 
4 2, 
ELTA ] = (4) ELTA] + 6(-5) ELTA) Te] + El(Te)*] 


< E[(Td)*] + 2E[(Td)}?(Te)}?] + E[(Te)*] 


< E[(Td)*] + 2VE[(Td)*]VE[(Te)*] + E[(Te)*] 
< Efd’? + 2E[d*] E[e*] + Ele’? 
= (E[d?] + Ele?) = ELf?P, 


where the second inequality is Cauchy—Schwarz, the third is induction, and the 
final equality is a simple computation analogous to (9.2). 


The name “hypercontractivity” in this theorem describes the fact that not 
only is T, , 3a “contraction” on L?({—1, 1}")—meaning IT f l2 < Il fll2 for 
all f (Exercise 2.33) — it’s even a contraction when viewed as an operator from 
L?({—1, 1}") to L4({—1, 1}”). You should think of hypercontractivity theorems 
as quantifying the extent to which T, is a “smoothing”, or “reasonable-izing” 
operator. 

Unfortunately the quantity ||T, ,/5 f|l4 in the (2, 4)-Hypercontractivity The- 
orem does not have an obvious combinatorial meaning. On the other hand, the 
quantity 


IN sfle = y Tish Tal) = (ATi alivaf) = VStabi al fl, 


does have a nice combinatorial meaning. And we can make this quantity appear 
in the Hypercontractivity Theorem via a simple trick from analysis, just using 
the fact that T, , 73 is a self-adjoint operator. We “flip the norms across 2” using 
Holder’s inequality: 


(4/3, 2)-Hypercontractivity Theorem. Let f : {—1, 1}" —> R. Then 


IT ya Ff lla < If lla; 


Stabi sl f] < II f lla). (9.5) 
Proof. Writing T = T, 3 for brevity we have 


ITAI = (TFTA) = (f, TTF) <M fllasiITT fla < WfllasiiTflls -6 


by Hölder’s inequality and the (2, 4)-Hypercontractivity Theorem. Dividing 
through by ||T f ||2 (which we may assume is nonzero) completes the proof. 
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In the inequality (9.5) the left-hand side is a natural quantity. The right-hand 
side is just 1 when f : {—1, 1}” — {—1, 1}, which is not very interesting. But 
if we instead look at f : {—1, 1}” — {0, 1} we get something very interesting: 


Corollary 9.8. Let A C {—1, 1}" have volume a; i.e., let 14 : {-1, 1}" > 
{0, 1} satisfy E[14] = a. Then 
Stabis[la4]= Pr [xe A.y € Als are, 
y~M1/3() 


Equivalently (for a > 0), 


Pr [ye A] <a!”. 
x~A 
y~N1/3() 


Proof. This is immediate from inequality (9.5), since 


2 
Wally =( EUL PA) = EEP = 0. 


See Section 9.5 for the generalization of this corollary to noise rates other 
than 1/3. 


Example 9.9. Assume œ = 2-* k e N+, and A is a subcube of codimension k; 
e.g., 14 : FS — {0, 1} is the logical AND function on the first k coordinates. 
For every x € A, when we form y ~ Nj/3(x) we'll have y € A if and only if the 
first k coordinates of x do not change, which happens with probability (2/3) = 
(2/3)280/9 = g186/2 ~ 85 < a!/?, In fact, the bound a!/? in Corollary 9.8 
is essentially sharp when A is a Hamming ball; see Exercise 9.24. 


We can phrase Corollary 9.8 in terms of the expansion in a certain graph: 


Definition 9.10. For n e N* and p € [—1, 1], the n-dimensional p-stable 
hypercube graph is the edge-weighted, complete directed graph on vertex set 
{—1, 1}” in which the weight on directed edge (x, y) € {—1, 1}” x {—1, 1}” is 
equal to Pr[(x, y) = (x, y)] when (x, y) is a p-correlated pair. If o = 1 — 26 
for 6 € [0, 1], we also call this the é-noisy hypercube graph. Here the weight 
on (x, y) is Pr[(x, y) = (x, y)] where x ~ {—1, 1}” is uniform and y is formed 
from x by negating each coordinate independently with probability ô. 


Remark 9.11. The edge weights in this graph are nonnegative and sum to 1. 
The graph is also “regular” in the sense that for each x € {—1, 1}” the sum of 
all the edge weight leaving (or entering) x is 2~”. You can also consider the 
graph to be undirected, since the weight on (x, y) is the same as the weight 
on (y, x); in this viewpoint, the weight on the undirected edge (x, y) would 
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be 21778401 — $)"-4@), In fact, the graph is perhaps best thought of as 
the discrete-time Markov chain on state space {—1, 1}” in which a step from 
state x € {—1, 1}” consists of moving to state y ~ N,(x). This is a reversible 
chain with the uniform stationary distribution. Each discrete step is equivalent 
to running the “usual” continuous-time Markov chain on the hypercube for 
time t = In(1/e) (assuming p € [0, 1]). 


With this definition in place, we can see Corollary 9.8 as saying that the 
1/3-stable (equivalently, 1/3-noisy) hypercube graph is a “small-set expander”: 
given any small a-fraction of the vertices A, almost all of the edge weight 
touching A is on its boundary. More precisely, if we choose a random vertex 
x € A and take a random edge out of x (with probability proportional to its 
edge weight), we end up outside A with probability at least 1 — a!/?. You 
can compare this with the discussion surrounding the Level-1 Inequality in 
Section 5.4, which is the analogous statement for the p-stable hypercube graph 
“in the limit o — 0+”. The appropriate statement for general p is appears in 
Section 9.5 as the “Small-Set Expansion Theorem”. 

Corollary 9.8 would apply equally well if 14 were replaced by a func- 
tion g : {—1, 1}" > {—1, 0, 1}, with æ denoting Pr[g 4 0] = E[|g|] = Elg?]. 
This situation occurs naturally when g =D; f for some Boolean-valued 
f: {-1, 1} > {-1, 1}. In this case Stab, /3[g] = Inf’ [f], the 1/3-stable 
influence of i on f. We conclude that for a Boolean-valued function, if the 
influence of i is small then its 1/3-stable influence is much smaller: 


Corollary 9.12. Let f : {—1, 1)" > {-1, 1}. Then Inf [f] < nf; [f 
for alli. 


We remark that the famous KKL Theorem (stated in Chapter 4.2) more or 
less follows by summing the above inequality over i € [n]; if you’re impatient 
to see its proof you can skip directly to Section 9.6 now. 


Let’s take one more look at the “small-set expansion result”, Corollary 9.8. 
Since noise stability roughly measures how “low” a function’s Fourier weight is, 
this corollary implies that a function f : {—1, 1}” — {0, 1} with small mean a 
cannot have much of its Fourier weight at low degree. More precisely, for any 
k € N we have 


a > Stabip[f] > U/3*WS[f] => Wf] < 3a. (9.7) 


For k = 1 this gives W='![f] < 3a°/*, which is nontrivial but not as strong 
as the Level-1 Inequality from Section 5.4. But (9.7) also gives us “level-k 
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inequalities” for larger values of k. For example, 
W=? losl] f] < g 25 log 3+3/2 < qi! K a= IF IIS; 


i.e., almost all of f’s Fourier weight is above degree .25 log(1/a). We will give 
slightly improved versions of these level-k inequalities in Section 9.5. 


9.3. (2, q)- and (p, 2)-Hypercontractivity for a Single Bit 


Although you can get a lot of mileage out of studying the 4-norm of random 
variables, it’s also natural to consider other norms. For example, we would get 
improved versions of our concentration and anticoncentration results, Proposi- 
tions 9.3 and 9.4, if we could bound the higher norms of a random variable in 
terms of its 2-norm. As we’ll see, we can also get stronger “level-k inequalities” 
by bounding the (2 + €)-norm of a Boolean function for small € > 0. 

We started with the 4-norm due to the simplicity of the proofs of the Bonami 
Lemma and the (2, 4)-Hypercontractivity Theorem. To generalize these results 
to other norms it’s a bit more elegant to work with the latter. Partly this is 
because it’s “formally stronger” (see Theorem 9.21). But the main reason is 
that the hypercontractivity version alleviates the inelegant issue that being 
“B-reasonable” is not translation-invariant. Thus instead of generalizing the 
condition that ||oX||4 < ||X||2 (“X is o~+-reasonable’”) we’ll generalize the 
condition that |a + pbX||4 < |la+bX||2 (cf. the n = 1 case of the (2, 4)- 
Hypercontractivity Theorem). 


Definition 9.13. Let 1 < p < q < œ and let 0 < p < 1. We say that a real 
random variable X (with || X ||, < œ) is (p, q, e)-hypercontractive if 


la + pbX||, < |la+bX||, forall constants a,b € R. 


Remark 9.14. By homogeneity, it suffices to check the condition for a = 1, 
b € Ror fora € R, b = 1 (cf. Exercise 9.9(a)). It’s also true (Exercise 9.11) 
that if X is (p, q, p)-hypercontractive then it is (p, q, p’)-hypercontractive for 
p' < pas well. 


In Exercise 9.10 you will show that if X is hypercontractive then E[X] 
must be 0. Thus hypercontractivity, like reasonableness, is not a translation- 
invariant notion. Nevertheless, the fact that the definition involves translation 
by an arbitrary a greatly facilitates proofs by induction. For example, an elegant 
property we gain from the definition is the following (Exercise 10.2): 
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Proposition 9.15. Let X and Y be independent (p,q, p)-hypercontractive 
random variables. Then X + Y is also (p, q, p)-hypercontractive. 


The n = 1 case of our (2, 4)-Hypercontractivity Theorem precisely says 
that a single uniformly random +1 bit x is (2, 4, 1/./3)-hypercontractive; 
the (4/3, 2)-Hypercontractivity Theorem says that x is also (4/3, 2, 1//3)- 
hypercontractive. We’ll spend the remainder of this section generalizing these 
facts to (2, q, p)- and (p, 2, e)-hypercontractivity for other values of p and q. 
We remark that in our study of hypercontractivity we'll focus mainly on the 
cases of p = 2 or q = 2. The study of hypercontractivity for p, q 4 2 and for 
random variables other than uniform +1 bits is deferred to Chapter 10. 

We now consider hypercontractivity of a uniformly random +1 bit x. We 
know that x is (2, q, 1/./3)-hypercontractive for q = 4; what about other values 
of q? Things are most pleasant when q is an even integer because then you 
don’t need to take the absolute value when computing ||a + pbX'||,. So let’s 


try q = 6. 


Proposition 9.16. For x a uniform +1 bit, we have |\|a + pbx|l6 < |la + bx|lz 
for all a,b €R if (and only if) p < 1/5. That is, x is (2,6, 1/</5)- 


hypercontractive. 
Proof. Raising the inequality to the 6th power, we need to show 
E[(a + pbx)f] < E[(a+ bx)’[. (9.8) 


The result is trivial when a = 0; otherwise, we may assume a = 1 by homo- 
geneity. We expand both quantities inside expectations and use the fact that 
E[x*] is 0 when k is odd and 1 when k is even. Thus (9.8) is equivalent to 


1+ 15p*b? + 15p4b* + p®b® < (1 + b? = 14 3b? + 3b* + bê. (9.9) 


Comparing the two sides term-by-term we see that the coefficient on b? is 
the limiting factor: in order for (9.9) to hold for all b € R it is sufficient that 
150? < 3; i.e., p < 1/V5. By considering b — 0 it’s also easy to see that this 
condition is necessary. 


If you repeat this analysis for the case of q = 8 you’ll find that again the 
limiting factor is the coefficient on b°, and that x is (2, 8, ¢)-hypercontractive 
if (and only if) (8)? < (sy i.e., p < 1/7. In light of this it is natural to guess 
that the following is true: 


Theorem 9.17. Let x be a uniform +1 bit and let q € (2, œo]. Then 
la + pbx||q < la + bx||2 for all a, b € R assuming p < 1/J/q — 1. 
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Equivalent statements are that |a + (1//q = 1)bx||2 < a° +b’, that x is 


(2,q,1/q — 1)-hypercontractive, and that \|T\;yg=1f lla < | fll2 holds for 
any f : {—1, 1} > R. 


For q an even integer it is not hard (see Exercise 9.36) to prove Theo- 
rem 9.17 just as we did for q = 6. Indeed, the proof works even under more 
general moment conditions on x, as in Corollary 9.6. Unfortunately, obtain- 
ing Theorem 9.17 for all real q > 2 takes some more tricks. A natural idea 
is to try forging ahead as in Proposition 9.16, using the series expansions for 
(1 + pbx)! and (1 + b*)4/? provided by the Generalized Binomial Theorem. 
However, even when |b| < 1 (so that convergence is not an issue) there is a 
difficulty because the coefficients in the expansion of (1 + b*)4/? are sometimes 
negative. 

Luckily, this issue of negative coefficients in the series expansion goes away 
if you try to prove the analogous (p, 2, o)-hypercontractivity statement. Thus 
the slick proof of Theorem 9.17 proceeds by first proving that statement, then 
“flipping the norms across 2”. 


Theorem 9.18. Let x be a uniform +1 bit and let 1 < p < 2. Then ||a + 
pbx|l2 < |a + bx||, for alla, b € R assuming 0 < p < Vp — 1. That is, x is 
(p, 2, /p — 1)-hypercontractive. 


Proof. By Remark 9.14 we may assume a = 1 and p = yp — I. By Exer- 
cise 9.7 we may also assume without loss of generality that 1 + bx > 0 for 
x € {-1, 1}; i.e., that |b| < 1. It then suffices to prove the result for all |b| < 1 
because the |b| = 1 case follows by continuity. Writing b = e for the sake of 
intuition, we need to show 


IL+ vyp- 1- exl < I1 + exl 
= El +vyp- 1- ex]? < E[(1 + ex)’]. (9.10) 


Here we were able to drop the absolute value on the right-hand side because 
|e| < 1. The left-hand side of (9.10) is 


(1+ (p — DPPP <1 4 PE Ve? (9.11) 


where we used the inequality (1 + t)? <1+6tfort > Oand0 <6 <1 (easily 
proved by comparing derivatives in t). As for the right-hand side of (9.10), since 
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|ex| < 1 we may use the Generalized Binomial Theorem to show it equals 


E [! + pex + PRD 62x? +4 pe DPD 63 y3 4 PP-DPTDP-I 4 4 as | 
= 1 + pe Efx] + Ve? Efx’] + PPP) Ex] 


a Bip- Wp 2p 9) 4 E[x*] Abts 


- —1(p=2(p-3 
= 14 De? 4 Pe DU: 2)(p—3) 


2 3 4 
: ai P(p—1)(p ue Mp=4)(p DeO4.., 


ef 


In light of (9.11), to verify (9.10) it suffices to note that each “post-quadratic” 
term above, 


P(p—V)(p—2)(p—3)--(p (2k+1)) 2k 
CH)! , 


is nonnegative. This follows from 1 < p < 2: the numerator has two positive 
factors and an even number of negative factors. 


To deduce Theorem 9.17 from Theorem 9.18 we again just need to flip the 
norms across 2 using the fact that T, is self-adjoint. This is accomplished by 
taking Q = {-1,1},7 =mp,qg =2,T =T pa and C = | in the following 
proposition (and noting that 1/./p’ — 1 = ~p — 1): 


Proposition 9.19. Let T be a self-adjoint operator on L?(Q,1), let 
1 < p,q <œ, and let p', q' be their conjugate Holder indices. Assume 
ITF lq < Cif llp for all f. Then Tg] y < Cligllq for all g. 


Proof. This follows from 


ITgllv = sup (f,Tg)= sup (Tf,g) < sup |ITfllqllglla < Clelia, 
If llp= I fllp=1 I fllp=1 


where the first equality is the sharpness of Hélder’s inequality, the second 
equality holds because T is self-adjoint, the third inequality is Hélder’s, and 
the final inequality uses the hypothesis ||T fl, < Cll f Ilp- 


At this point we have established that if x is a uniform +1 bit, then it 
is (2, q, 1/./q — 1)-hypercontractive and (p, 2, ./p — 1)-hypercontractive. In 
the next section we will give a very simple induction which transforms these 
facts into the full (2, g)- and (p, 2)-Hypercontractivity Theorems stated at the 
beginning of the chapter. 
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9.4. Two-Function Hypercontractivity and Induction 


At this point we have established that if f : {—1,1}— R then for any 
ps2<4q, 


IT yp fille < If lp Tip vat f lla < WF lle. 


We would like to extend these facts to the case of general f : {—1, 1}” —> R; 
i.e., establish the (p, 2)- and (2, g)-Hypercontractivity Theorems stated at the 
beginning of the chapter. A natural approach is induction. 

In analysis of Boolean functions, there are two methods for proving 
statements about f : {—1, 1}" —> R by induction on n. One method, which 
might be called “induction by derivatives”, uses the decomposition f(x) = 
XnDn f(x) + En f(x). We saw this approach in our inductive proof of the 
Bonami Lemma. The other method, which might be called “induction by 
restrictions”, goes via the subfunctions f+; obtained by restricting the nth 
coordinate of f to +1. We saw this approach in our proof of the OSSS Inequal- 
ity in Chapter 8.6. In both methods we reduce inductively from one function f 
to two functions: either D, f and E, f, or f_; and f,,. Because of this, when 
trying to prove a fact by induction on n it’s often helpful to try proving a 
generalized fact about two functions. Our proof of the OSSS Inequality gives a 
good example this technique. 

So to facilitate induction, let’s find a two-function version of the hypercon- 
tractivity statements we’ve proven so far. Perhaps the most natural statement 
we’ve seen is the noise-stability rephrasing of the (4/3, 2)-Hypercontractivity 
Theorem, namely Stab, [f] < If Iž J3: At least in the case n = 1, our work 
in the previous section (Theorem 9.18) generalizes this to Stab,_i[f] < If II; 
for 1 < p < 2. L.e., 


Stab [f= E SOSON S So 


p-correlated 


for 0 < p < 1. Looking at this, you might naturally guess a (correct) general- 
ization for two functions f, g : {—1, 1}” —> R, namely 


e LEO < IF lli+pllglli+y- (9.12) 


p-correlated 


We have a nice interpretation of this inequality when f, g : {—1, 1}” > {0, 1} 
are indicators of subsets A, B C {—1, 1}” as in Corollary 9.8; it gives an upper 
bound on the probability of going from A to B in one step on the p-stable 
hypercube graph. This bound is sharp when A and B have the same volume, 
but for A and B of different sizes you might imagine it’s helpful to measure f 
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and g by different norms in (9.12). To see what we can expect, let’s break up 
the p-correlation in (9.12) into two parts; say, write 


p=vVrs, 0<rs <l, 
and use 


as [f(x)g(y)] = EIT yf - Tyg. 
Pee 


Then Cauchy—Schwarz implies 


a FON = EIT ye f - Tyg) < IT yr fllllT ysl 


p-correlated 


SWflharllgties, (9.13) 


where the last step used (p, 2)-hypercontractivity — which we have so far 
only proven in the case n = | (Theorem 9.18). The inequality (9.13), restated 
below, is precisely the desired two-function version of the (2, q)- and (p, 2)- 
Hypercontractive Theorems. 


(Weak) Two-Function Hypercontractivity Theorem. Let f, g : {—1, 1}" > 
R, letO < r,s < 1, and assume 0 < p < ./rs < 1. Then 


ae LEBON S MF lellis. 


p-correlated 


We call this the “Weak” Two-Function Hypercontractivity Theorem because 
the hypothesis 7, s < 1 is not actually necessary; see Chapter 10.1. As men- 
tioned, we have so far established this theorem in the case n = 1. However, the 
beauty of hypercontractivity in this form is that it extends to general n by an 
almost trivial induction. The form of the induction is “induction by restrictions”. 
(It’s also possible — but a little trickier — to extend the (2, g)-Hypercontractivity 
Theorem from n = | to general n via “induction by derivatives”; see Exer- 
cise 9.16.) For future use, we will write the induction in more general notation. 


Two-Function Hypercontractivity Induction Theorem. Let 0 < p < 1 and 
assume that 


es LLEBO < IF llpllglla 
ieoroinied 


holds for every f, g € L?(Q, 1). Then the inequality also holds for every f, g € 
L?(Q", m2”). 
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Proof. The proof is by induction on n, with the n = 1 case holding by assump- 
tion. Forn > I, let f, g € L?(Q", n8") and let (x, y) denote a p-correlated pair 
under x8”. We’ ll use the notation x = (x’, x,) where x’ = (x1, ..., Xn—1), and 
similar notation for y. Note that (x’, y’) and (xn, y,,) are both p-correlated pairs 
(of length n — 1 and 1, respectively). We’ll also write fx, = f{n—1)\x, for the 
restriction of f in which the last coordinate is fixed to value x,, and similarly 
for g. Now 


E[f@gMl= E E (hg, 5 E ISe lol, lla] 
(x.y) (ns In) HY) (nn) 


by induction. If we write F € L?(Q, 7) for the function x, + || Fx, l|p and 
similarly write G(y,) = || gy, lz, then we may continue the above as 


en [Il fe, IIpIl8y, lq] = an [EEY] < IlF llp.x,IGlla.x,> 


where we used the base case of the induction. Finally, 
pD / 1 
IF lp, = ELF (end?) = El fe 1I? = (EEL fn EP) = WF lp 


by definition, and similarly for ||Gl|,,,.,. Thus we have established 
E[ f(x)g(y)] < lf llpllgll,, completing the induction. 


Remark 9.20. More generally, if we assume the inequality holds over each of 
(QQ), T1), ..., (Qn, Tn), then it also holds over (Qy x +++ X Qa, T1 Q +++ Q Tr); 
the only change needed to the proof is notational. 


At this point, we have fully established the Weak Two-Function Hyper- 
contractivity Theorem. By taking g = f and r = s = p in the theorem we 
obtain the full (p, 2)-Hypercontractivity Theorem stated at the beginning of 
the chapter. Finally, by applying Proposition 9.19 we also obtain the (2, q)- 
Hypercontractivity Theorem for all f : {—1, 1” —> R. 


9.5. Applications of Hypercontractivity 


With the (2, q)- and (p, 2)-Hypercontractivity Theorems in hand, let’s revisit 
some applications we saw in Sections 9.1 and 9.2. We begin by deducing a 
generalization of the Bonami Lemma: 


Theorem 9.21. Let f : {—1, 1}" > R have degree at most k. Then || f \lq < 
k 
vq = 1 I fll2 for any q > 2. 
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Proof. We have 


FIZ = IT ygaT yaa 2 < IT ya i 


using the (2, g)-Hypercontractivity Theorem. (Here we are extending the defi- 
nition of T, top > 1 viaT, f = pee p} fT}; see also Remark 8.29.) The result 
now follows since 


k k 
IT aasi = 90 @ — YWIS < a- DESO WE = @ - DNS. 


j=0 j=0 


Using a trick similar to the one in our proof of the (4/3, 2)- 
Hypercontractivity Theorem you can use this to deduce ||fll2 < 
(1/./p — D*II f llp when f has degree k for any 1 < p < 2; see Exercise 9.14. 
However, a different trick yields a strictly better result, including a finite bound 
for p = 1: 


Theorem 9.22. Let f : {—1, 1}" —> R have degree at most k. Then || f \l2 < 
2 
e* || f ll). More generally, for 1 < p < 2 it holds that || f |2 < (e? "SII Flp- 


Proof. We prove the statement about the 1-norm, leaving the case of general 1 < 
p < 2to Exercise 9.15. Fore > 0,let0 < 0 < 1 bethe solution of 4 = 2 + = 
(namely, 0 = Sr) Applying the general version of Hölder’s inequality and 
then Theorem 9.21, we get 


= ka-0) o1- 
Ifl < FISI? < V1 +e IA UF. 


Dividing by || f Da (which we may assume is nonzero) and then raising the 
result to the power of 1/0 yields 


ifks (€0+97) ifn = (0+9) if 


The result follows by taking the limit as € > 0. 


In the linear case of k = 1, Theorems 9.21 and 9.22 taken together show 
that cp|| >>; axill < | 0; aixillp < Coll ¥¢; aixill2 for some constants 0 < 
Cp < Cp depending only on p € [1, oo). This fact is known as Khintchine’s 
Inequality. 

Theorem 9.21 can be used to get a strong concentration bound for degree-k 
Boolean functions. Chernoff tells us that the probability a linear form }` a;x; 
exceeds ¢ standard deviations decays like exp(—@(#?)). The following theorem 
generalizes this to degree-k forms, with decay exp(—@(1?/*)): 
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Theorem 9.23. Let f : {—1, 1}" — R have degree at most k. Then for any 


k 
t > /2e we have 


xh allt OO > tll fllo] < exp (—£17/") . 


x 


Proof. We may assume || f ||2 = 1 without loss of generality. Let q > 2 be a 
parameter to be chosen later. By Markov’s inequality, 


Pri| f(x)| = t] = Pril fœ] = 11 < a 


By Theorem 9.21 we have 


ELS] < VIZ TNN = @ — DE < q, 


Thus Pr[| f (x)| > t] < (qP. It’s not hard to see that the q that minimizes 
this expression should be just slightly less than t?/*. Specifically, by choosing 
q = t*/k/e > 2 we get 


Pr[| f(x)| = t] < exp(—(k/2)q) = exp (—£1°/*) 


as claimed. 


We can use Theorem 9.22 to get a “one-sided” analogue of Theorem 9.7, 
showing that a low-degree function exceeds its mean with noticeable probabil- 


ity: 


Theorem 9.24. Let f : {—1, 1}” — R be a nonconstant function of degree at 
most k. Then 


x 


P E[f]] > fe”. 
Pr [Aœ > EII] = 3e 
Proof. We may assume E[ f] = 0 without loss of generality. We then have 


HIFI = 4 (ELF - Lyw] — ELF -C0 — Uyesoy)]) = ELF + hræ]; 
hence, 


FSI = EL -lirwo < EL] ER? oo] < eI Prif) > 0] 


using Cauchy—Schwarz and Theorem 9.22. The result follows. 


Next we turn to noise stability. Using the (p, 2)-Hypercontractivity Theorem 
we can immediately deduce the following generalization of Corollary 9.8: 
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Small-Set Expansion Theorem. Let A C {—1, 1}" have volume a; i.e., let 
la : {-1, 1}" = {0, 1} satisfy E[1,] =a. Then for any 0 = p < 1, 
Stab,[14] = Eren EA, yE€A]< aTr. 
Y~N,(x) 


Equivalently (for a > 0), 


Pr [ye A] <a. 


In other words, the 6-noisy hypercube is a small-set expander for any ô > 0: 
the probability that one step from a random x ~ A stays inside A is at most 
a°/I-%) Its also possible to derive a “two-set” generalization of this fact 
using the Two-Function Hypercontractivity Theorem; we defer the discussion 
to Chapter 10.1 since the most general result requires the non-weak form of the 
theorem. We can also obtain the generalization of Corollary 9.12: 


Corollary 9.25. Let f : {-1, 1}" > {-1, 1}. Then for any 0 < p < 1 we have 
MEP If] < Infi[f1"* for all i. 


Finally, from the Small-Set Expansion Theorem we see that indicators of 
small-volume sets are not very noise-stable and hence can’t have much of their 
Fourier weight at low levels. Indeed, using hypercontractivity we can deduce 
the Level-1 Inequality from Chapter 5.4 and also generalize it to higher degrees. 


Level-k Inequalities. Let f : {—1, 1}” —> {0, 1} have mean E[ f] = a@ and let 
k e N? be at most 2\n(1/a). Then 


WEIS] < (22 In(./a))* o. 
In particular, defining ke = 2(1 — €) In(1/a) (for any 0 < € < 1) we have 
WEIS] sae. 


Proof. By the Small-Set Expansion Theorem, 


WELF] < p™Stab [f] < p ™a tA < pta? 


for any 0 < p < 1. Basic calculus shows the right-hand side is minimized when 
p= IUS < 1; substituting this into p~*a?"-?) yields the first claim. The 
second claim follows after substituting k = ke; see Exercise 9.19. 


For the case k = 1, a slightly different argument gives the sharp Level-1 
Inequality W'[ f] < 2a? In(1/a); see Exercise 9.18. 
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9.6. Highlight: The Kahn—-Kalai-Linial Theorem 


Recalling the social choice setting of Chapter 2.1, consider a 2-candidate, 
n-voter election using a monotone voting rule f : {—1, 1}” —> {-1, 1}. We 
assume the impartial culture assumption (that the votes are independent and 
uniformly random), but with a twist: one of the candidates, say b € {—1, 1}, is 
able to secretly bribe k voters, fixing their votes to b. (Since f is monotone, this 
is always the optimal way for the candidate to fix the bribed votes.) How much 
can this influence the outcome of the election? This question was posed by Ben- 
Or and Linial in a 1985 work (Ben-Or and Linial, 1985, 1990); more precisely, 
they were interested in designing (unbiased) voting rules f that minimize the 
effect of any bribed k-coalition. 

Let’s first consider k = 1. If voter i is bribed to vote for candidate b 
(but all other votes remain uniformly random), this changes the bias of f 
by bf li ) = bInf;[ f ]. Here we used the assumption that f is monotone (i.e., 
Proposition 2.21). This led Ben-Or and Linial to the question of which unbiased 
f :{-1, 1} — {-1, 1} has the least possible maximum influence: 


Definition 9.26. Let f : {—1, 1}" —> R. The maximum influence of f is 
MaxInf[ f] = max{Inf;[f] : i € [n]}. 


Ben-Or and Linial constructed the (nearly) unbiased Tribes, : {—1, 1} > 
{—1,1} function (from Chapter 4.2) and noted that it satisfies 
MaxInf[Tribes,,] = O22), They further conjectured that every unbiased 


function f has MaxInf[ f] = (284), This conjecture was famously proved 
by Kahn, Kalai, and Linial (Kahn et al., 1988): 


Kahn-Kalai-Linial (KKL) Theorem. For any f : {—1, 1}" — {-1, 1}, 
men) 
ae | 


MaxInf[ f] > Var[f1- 2( 


Notice that the theorem says something sensible even for very biased func- 
tions f, i.e., those with low variance. The variance of f is indeed the right 
“scaling factor” since 


L Vari f] < MaxInf[ f] < Var[ f] 


holds trivially, by the Poincaré Inequality and Exercise 2.8. 
Before proving the KKL Theorem, let’s see an additional consequence for 
Ben-Or and Linial’s problem. 


Proposition 9.27. Let f : {—1, 1}” —> {—1,1} be monotone and assume 
E[f] > —.99. Then there exists a subset J C [n] with |J| < O(n/logn) 
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that if “bribed to vote 1” causes the outcome to be 1 almost surely; i.e., 


EL fia 


Similarly, if E[ f] < .99 there exists J C [n] with |J| < O(n/logn) such that 
El fici... -p] < —-99. 


Proof. By symmetry it suffices to prove the result regarding bribery by candi- 
date +1. The candidate executes the following strategy: First, bribe the voter i 
with the largest influence on fọ = f; then bribe the voter iz with the largest 
influence on fı = f“'*); then bribe the voter iz with the largest influence 
on fy = fP D, etc. For each t € N we have 


iy) Z 99. (9.14) 


Sot) 


ees 


E[ fi+1] 2 ELF] + MaxInf[f,]. 


If after ż bribes the candidate has not yet achieved (9.14) we have 
—.99 < E[f;] < .99; thus Var[f;] > Q(1) and the KKL Theorem implies that 
MaxInf[ f] > (284), Thus the candidate will achieve a bias of at least .99 
after bribing at most (.99 — (—.99))/ aE) = O(n/ logn) voters. 


Thus in any monotone election scheme, there is always a candidate b € 
{—1, 1} and a o(1)-fraction of the voters that b can bribe such that the election 
becomes 99%-biased in b’s favor. And if the election scheme was not terribly 
biased to begin with, then both candidates have this ability. For a more precise 
version of this result, see Exercise 9.27; for a nonmonotone version, see Exer- 
cise 9.28. Note also that although the Tribes„ function is essentially optimal 
for standing up to a single bribed voter, it is quite bad at standing up to bribed 
coalitions: by bribing just a single tribe (DNF term) — about log voters — the 
outcome can be completely forced to True. Nevertheless, Proposition 9.27 is 
close to sharp: Ajtai and Linial (Ajtai and Linial, 1993) constructed an unbiased 
monotone function f : {—1, 1}” —> {—1, 1} such that bribing any set of at most 
en/ log’ n voters changes the expectation by at most O(e). 

The remainder of this section is devoted to the proof of the KKL Theorem and 
some variants. As mentioned earlier, the proof quickly follows from summing 
Corollary 9.12 over all coordinates; but let’s give a more leisurely description. 
We’ll focus on the main case of interest: showing that MaxInf[ f] > (284) 
when f is unbiased (i.e., Var[ f] = 1). If f’s total influence is at least, say, 
.l logn, then even the average influence is Qer), So we may as well assume 
I[f] < .1 logn. 

This leads us to the problem of characterizing (unbiased) functions with 
small total influence. (This is the same issue that arose at the end of Chapter 8.4 
when studying sharp thresholds.) It’s helpful to think about the case that the 
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total influence is very small — say I[ f] < K where K = 10or K = 100, though 
we eventually want to handle K = .1 logn. Let’s think of f as the indicator of a 
volume-1/2 set A C {—1, 1y”, so 14 is the fraction of Hamming cube edges on 
the boundary of A. The edge-isoperimetric inequality (or Poincaré Inequality) 
tells us that I[ f] > 1: at least a 1 fraction of the cube’s edges must be on A’s 
boundary, with dictators and negated-dictators being the minimizers. Now 
what can we say if I[ f] < K;1.e., A’s boundary has only K times more edges 
than the minimum? Must f be “somewhat similar” to a dictator or negated- 
dictator? Kahn, Kalai, and Linial showed that the answer is yes: f must have a 
coordinate with influence at least 2~°“*). This should be considered very large 
(and dictator-like), since a priori all of the influences could have been equal 
to £. 


KKL Edge-Isoperimetric Theorem. Let f : {—1, 1}” —> {—1, 1} be noncon- 
stant and let V[f|] = I[ f]/ Var[ f] > 1 (which is just IL f] if f is unbiased). 
Then 


MaxInf[ f] > (rer) ELT, 


This theorem is sharp for I’[f] = 1 (cf. Exercises 1.19, 5.35), and it’s 
nontrivial (in the unbiased case) for I[ f ] as large as O(log n). This last fact lets 
us complete the proof of the KKL Theorem as originally stated: 


Proof of the KKL Theorem from the Edge-Isoperimetric version. We may 
assume f is nonconstant. If I'L f] =ILf]/ VarLf] > .llogn, then we are 
done: the total influence is at least .1 Var[ f] - logn and hence MaxInf[ f] > 
1 Var[ f] - 2". Otherwise, the KKL Edge-Isoperimetric Theorem implies 


r 


MaxlInf[ f] > o( 1 ) .g9>llogn — Q(n™1 89 — Qn!) 


log? n 


> Varl f]: Q (24). 


(You are asked to be careful about the constant factors in Exercise 9.30.) 


We now turn to proving the KKL Edge-Isoperimetric Theorem. The high- 
level idea is to look at the contrapositive: supposing all of f’s influences are 
small, we want to show its total influence must be large. The assumption here is 
that each derivative D; f is a {—1, 0, 1}-valued function which is nonzero only 
on a “small” set. Hence “small-set expansion” implies that each derivative has 
“unusually large” noise sensitivity. (We are really just repeating Corollary 9.12 
in words here.) In turn this means that for each i € [n], the Fourier weight of f 
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on coefficients containing i must be quite “high up”. Since this holds for all i 
we deduce that all of f’s Fourier weight must be quite “high up” — hence f 
must have “large” total influence. We now make this story formal: 


Proof of the KKL Edge-Isoperimetric Theorem. We treat only the case that f 
is unbiased, leaving the general case to Exercise 9.29 (see also the version 
for product space domains in Chapter 10.3). The theorem is an immediate 
consequence of the following chain of inequalities: 


(a) O i © $ 3 @) i 
3:371 < 3Stabi[f] < ILS] < > Inf;[f]? < MaxInf[ f]? - If]. 


i=1 


The key inequality is (c), which comes from summing Corollary 9.12 
over all coordinates i € [n]. Inequality (d) is immediate from Inf;[f p2 < 
MaxInf| f]! - Inf;[f]. Inequality (b) is trivial from the Fourier formulas 
(recall Fact 2.53): 


LA = Y 810/397 ASP = 3 X 0/3) FS? = 3Stabi sIf] 


|S|>1 |S|>1 


(the last equality using f(D) = 0). Finally, inequality (a) is quickly proved 
using the spectral sample: for S$ ~ & ¢ we have 


3Stab, [f] = 3 Yi (1/3)'5! f(s)? = 3 E[37!S] > 3-37 FSI = 3. 371/1, 
Sc[n] 
(9.15) 
the inequality following from convexity of s +> 3~*. We remark that it’s essen- 
tially only this (9.15) that needs to be adjusted when f is not unbiased. 


We end this chapter by deriving an even stronger version of the KKL Edge- 
Isoperimetric Theorem, and deducing Friedgut’s Junta Theorem (mentioned 
at the end of Chapter 3.1) as a consequence. The KKL Edge-Isoperimetric 
Theorem tells us that if f is unbiased and I[ f] < K then f must look somewhat 
like a 1-junta, in the sense of having a coordinate with influence at least 272%), 
Friedgut’s Junta Theorem shows that in fact f must essentially be a2°“-junta. 
To obtain this conclusion, you really just have to sum Corollary 9.12 only over 
the coordinates which have small influence on f. It’s also possible to get even 
stronger conclusions if f is known to have particularly good low-degree Fourier 
concentration. In aid of this, we’ll start by proving the following somewhat 
technical-looking result: 
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Theorem 9.28. Let f : {—1, 1}" —> {-1, 1}. Given O < € <1 and k > Q, 
define 
2 
HSP 


Then f’s Fourier spectrum is €-concentrated on 


9%,  J={j e[n]: mf; [f]z t}  so|J| < UST. 


F={S:STJ}U{S:|S| > k}. 


In particular, suppose f’s Fourier spectrum is also €-concentrated on degree 
up to k. Then f’s Fourier spectrum is 2€-concentrated on 


={S:SCJ,|S| <k}, 
and f is €-close to a|J|-junta h : {—1, 1}4 > {—1, 1}. 


Proof. Summing Corollary 9.12 just over i ¢ J we obtain 


Som! < Y mfp < max (Inf, L1 So Inf [1 
igJ igJ igJ 


<t". If] <3% 


where the last two inequalities used the definitions of J and t, respectively. On 
the other hand, 


Smt Lf] => Sega 1 NSP = ae |S N J| $ 31131 ASY 
S 


igJ igJ S>i 


DD Sass aO 
SEF SEF 


Here the last inequality used that S ¢ F implies |S N J| > 1 and 3!—!5! > 3-*. 
Combining these two deductions yields }°, dF fis } < e, as claimed. 

As for the second part of the theorem, when f’s Fourier spectrum is 2e- 
concentrated on #’ it follows from Proposition 3.31 that f is 2e-close to the 
Boolean-valued |J |-junta sgn( f S7). From Exercise 3.31 we may deduce that f 
is in fact e-close to some h : {—1, 1}7 > {—1, 1}. 


Remark 9.29. As you are asked to show in Exercise 9.31, by using Corol- 
lary 9.25 in place of Corollary 9.12, we can achieve junta size (I[ f ]?+” /e!*”) - 
C(n)* in Theorem 9.28 for any n > 0, where C(n) = (2/7 + 1). 


In Theorem 9.28 we may always take k = I[ f]/e, by the “Markov argument” 
Proposition 3.2. Thus we obtain as a corollary: 
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Friedgut’s Junta Theorem. Let f : {—1, 1}” > {—1, 1} and let0 < € <1. 
Then f is €-close to an exp(O(I[ f]/€))-junta. Indeed, there is a set J C [n] 
with |J| < exp(O(ILf]/e€)) such that f’s Fourier spectrum is 2€-concentrated 
on {S CJ :|S| <ILfl/e}. 


As mentioned, we can get stronger results for functions that are e- 
concentrated up to degree much less than I[ f]/e. Width-w DNFs, for example, 
are €-concentrated on degree up to O(w log(1/e)) (by Theorem 4.22). Thus: 


Corollary 9.30. Any width-w DNF is €-close to a (1 /€)°™ -junta. 


Uniformly noise-stable functions do even better. From Peres’s Theorem we 
know that linear threshold functions are €-concentrated up to degree O(1/e°). 
Thus Theorem 9.28 and Remark 9.29 imply: 


Corollary 9.31. Let f : {—1, 1}" —> {-1, 1} be a linear threshold function 
and let 0 < €,n < 1/2. Then f is €-close to a junta on I[f *" - (1/n)ea/e? 
coordinates. 


Assuming € is a small universal constant we can take n = 1/log(O([f])) and 
deduce that every LTF is €-close to a junta on I[ f]? - polylog(I[ f]) coordinates. 
This is essentially best possible since I[Maj,,] = ©(,/n), but Maj, is not even 
.l-close to any o(n)-junta. By virtue of Theorem 5.37 on the uniform noise 
stability of PTFs, we can also get this conclusion for any constant-degree PTF. 

One more interesting fact we may derive is that every Boolean function has 
a Fourier coefficient that is at least inverse-exponential in the square of its total 
influence: 


Corollary 9.32. Assume f : {-1, 1}" — {—1, 1} satisfies Var[f] > 1/2. 
Then there exists S C [n] with O < |S| < O([f]) such that f(s > 
exp(— O(I[ f}°)). 


Proof. Taking e = 1/8 in Friedgut’s Junta Theorem we get a J with 
|J| < exp(O(I[f])) such that f has Fourier weight at least 1 — 2e = 3/4 
on F = {S C J : S < 8I[f]}. Since FOP = | — Var[f] < 1/2 we conclude 
that f has Fourier weight at least 1/4 on F' = ¥ \ {Ø}. But |¥’| < |J |! = 
exp(— O(ILf 7)), so the result follows by the Pigeonhole Principle. 
(Here we used that (1/4) exp(— O(I[ f}*)) = exp(—O(I[ f}*)) because 
I[f] > Var[f] = 4.) 


Remark 9.33. Of course, if Var[ f] < 1/2, then f has a large empty Fourier 
coefficient: f (Ø) > 1/2. For a more refined version of Corollary 9.32, see 
Exercise 9.32. 
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It is an open question whether Corollary 9.32 can be improved to give a 
Fourier coefficient satisfying f (SP? > exp(— O(I[ f])); see Exercise 9.33 and 
the discussion of the Fourier Entropy—Influence Conjecture in Exercise 10.23. 


9.7. Exercises and Notes 


9.1 Forevery 1 < b < B show that there is a b-reasonable random variable X 
such that 1 + X is not B-reasonable. 

9.2 Fork = 1, improve the 9 in the Bonami Lemma to 3. More precisely, sup- 
pose f : {—1, 1}” —> R has degree at most | and that x1, ..., x, are inde- 
pendent 3-reasonable random variables satisfying E[x;] = E[x?] =0. 
(For example, the x;’s may be uniform +1 bits.) Show that f(x) is also 
3-reasonable. (Hint: By direct computation, or by running through the 
Bonami Lemma proof with k = 1 more carefully.) 


9.3 Let k be a positive multiple of 3 and let n > 2k be an integer. Define 
f :{-1, 1 —> R by 


f= >} xs. 


SC[n] 
|S|=k 


(a) Show that 


n 
(1/3, &/3, 4/3, &/3, 8/3, K/3, n—2k 
n\2 
(a) 
where the numerator of the fraction is a multinomial coefficient — 
specifically, the number of ways of choosing six disjoint size-k/3 
subsets of [n]. (Hint: Given such size-k/3 subsets, consider quadru- 


ples of size-k subsets that hit each size-k/3 subset twice.) 
Using Stirling’s Formula, show that 


E[f*] > ) E[f’P, 


(b 


wm 


Üi T TE 
io Gy 


= O(k~79*), 


Deduce the following lower bound for the Bonami Lemma: || f ||4 > 
QU") VBI fl. (In fact, || fla = OC") - V3" 1 fll2 and such 
an upper bound holds for all f homogeneous of degree k; see Exercise 
and 9.38(f).) 

9.4 Prove Corollary 9.6. 
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9.5 Let0 < ô < a and let f, £ be real numbers satisfying |£ — 1| > 39/8 
and |f| = 1. Show that |f — £|? > 1695. (This is a loose estimate; 
stronger ones are possible.) 

9.6 Theorem 9.21 shows that the (2, 4)-Hypercontractivity Theorem implies 


the Bonami Lemma. In this exercise you will show the reverse implication. 


(a) Let f : {—1, 1}" — R. Fora fixed ô € (0, 1), use the Bonami Lemma 
to show that 


CO 
ITa- fla < XCA = 6) FIle < Fl fille. 


k=0 


(b) For g : {—1, 1}” > R and d EN‘, let g®@ : {—1, 1} — R be the 
function defined by g®4(x,...,x©) = g(x) g(x)--- g(x) 
(where each x e€ {—1, 1}”). Show that ||T,(g®)|lp = ||Tpgll¢ holds 
for every p € R* and p € [—1, 1]. Note the special case p = 1. 

(c) Deduce from parts (a) and (b) that in fact ITs vaF ll4 < || fille. 
(Hint: Apply part (a) to f® for larger and larger d.) 

(d) Deduce that in fact IT v3 lla <||fllo; ie, the (2, 4)- 
Hypercontractivity Theorem follows from the Bonami Lemma. (Hint: 
Take the limit as 8 > 0*.) 


9.7 Suppose we wish to show that ||T, f lla < Il fllp forall f : {-1, 1}" > R. 


Show that it suffices to show this for all nonnegative f. (Hint: Exer- 
cise 2.34.) 
9.8 Fixk € N. The goal of this exercise is to show that “projection to degree k 
is a bounded operator in all L? norms, p > 1”. Let f : {-1, 1}" > R. 
(a) Let q > 2. Show that || f= ll; < JT- PIS lla: (Hint: Use Theo- 
rem 9.21 to show the stronger statement || f=*||, < Jg- T Il f li2-) 
(b) Let 1 < q < 2. Show that || f=*||, < A/V = D*|| f lq. Hint: Either 
give a similar direct proof using the (p, 2)-Hypercontractivity Theo- 
rem, or explain how this follows from part (a) using the dual norm 
Proposition 9.19.) 
9.9 Let X be (p, q, e)-hypercontractive. 
(a) Show that cX is (p, q, e)-hypercontractive for any c € R. 


(b) Show that p < He, 


9.10 Let X be (p,q, e)-hypercontractive. (For simplicity you may want to 


assume X is a discrete random variable.) 
(a) Show that E[X] must be 0. (Hint: Taylor expand ||1 + pe X ||- to one 
term around € = 0; note that p < 1 by definition.) 
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(b) Show that p < J= . (Hint: Taylor expand || 1 + p€ X ||, to two terms 
around € = 0.) 
9.11 (a) Suppose E[X] = 0. Show that X is (q, q, 0)-hypercontractive for all 
q = 1. (Hint: Use monotonicity of norms to reduce to the case q = 1.) 
(b) Show further that X is (q, q, e)-hypercontractive for all 0 < p < 1. 
(Hint: Write (a + oX) = (1 — p)a + p(a + X) and employ the tri- 
angle inequality for || - |7.) 
(c) Show that if X is (p, q, e)-hypercontractive, then it is also (p, q, p’)- 
hypercontractive for all 0 < p’ < p. (Hint: Use the previous exercise 
along with Exercise 9.10(a).) 
9.12 Let X be a (nonconstant) (2, 4, o)-hypercontractive random variable. The 
goal of this exercise is to show the following anticoncentration result: For 
allð ce RandO <t <1, 


Pr[|X — 0| > tX] > A — Y p*. 


(a) Reduce to the case ||X||2 = 1. 

(b) Letting Y = (X —6)?, show that E[Y]=1+6? and E[Y?] < 
(07? $ 67)", 

(c) Using the Paley—-Zygmund inequality, show that 


pr — 1) + me); 


Pr[|X — 6| >t] > ( ee 


(d) Show that the right-hand side above is minimized for 0 = 0, thereby 
completing the proof. 


9.13 Let m e N+ and let f : {-1,1}" > [m] be “unbiased”, meaning 
Pr[ f(x) =i] = Ł for all i € [m]. Let 0 < p < 1 and let (x, y) be a 
p-correlated pair. Show that Pr[ f(x) = f(y)] < (1/m) 70+, (More 
generally, you might show that this is an upper bound on Stab,[ f] for all 
f:{-l, 1} —> An with E[ f] = (£, Sen +); see Exercise 8.33.) 

9.14 (a) Let f : {-1, 1}” — R have degree at most k. Prove that || fll2 < 
(1//p — D*IIf lp for any 1 < p < 2 using the Holder inequality 
strategy from our proof of the (4/3, 2)-Hypercontractivity Theorem, 
together with Theorem 9.21. 

(b) Verify that exp(> —1) <1//p —T forall 1 < p < 2; i.e., the trick- 
ier Theorem 9.22 strictly improves on the bound from part (a). 


9.15 Prove Theorem 9.22 in full generality. (Hint: Let 6 be the solution of 


l= 2+ 5=*. You will need to show that 57 = (¢ —1i+ G -4)) 
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9.16 As mentioned, it’s possible to deduce the (2, q)-Hypercontractivity Theo- 


9.17 


9.18 


9.19 


9.20 


9.21 


rem from the n = | case using induction by derivatives. From this one can 
also obtain the (p, 2)-Hypercontractivity Theorem via Proposition 9.19. 
Employing the notation x = (x',x,), T = Ti; ygi; d = Dn f(x’), and 
e=E, f(x’), fill in details and justifications for the following proof 
sketch: 


2/ 
ITa lg = E[E[ITe + 0/V4 = Dx, Tals] i 
< E[((Te)* + (Tape 


= ||(Te)” + (Td)"lIq2 < NTEP + ITD a2 
= ||Tel? + Td? < lel} + ldi? = IF 1}. 


Deduce the p < 2 < q cases of the Hypercontractivity Theorem from the 

(2, q)- and (p, 2)-Hypercontractivity Theorems. (Hint: Use the semigroup 

property of Tp, Exercise 2.32.) 

Let f : {—1, 1}" > {0, 1} have E[f] = a. 

(a) Show that W'[f] < oe — a’) for any0 <p < 1. 

(b) Deduce the sharp Level-1 Inequality W'[ f] < 207 In(1 /a). (Hint: 
Take the limit p > 0*.) 

In this exercise you will prove the second statement of the Level-k Inequal- 

ities. 

(a) Show that choosing k = ke in the theorem yields 


Wk [f] < a2 2-26) Ind /—e)) | 


(b) Show that 2e — (2 — 2e)In(1/(. — €)) > e? forall0 <€ <1. 
Show that the KKL Theorem fails for functions f : {-1, 17 > 
{[—1, 1], even under the assumption Var[f] > Q(1). (Hint: f(x) = 


trunc,—1,14(4 2 )-) 


(a) Show that €= {f :{—1, 1} > {-1, 1} | ILf] < OC /logn)} is 
learnable from queries to any constant error € > 0 in time poly(n). 
(Hint: Theorem 9.28.) 

(b) Show that € = {monotone f : {—1, 1}” > {-1, 1} | I[f] < 
O(./log n)} is learnable from random examples to any constant error 
e€ > Oin time poly(n). 

(c) Show that @= {monotone f : {—1, 1}” > {-1, 1} | DTsize( f) < 
poly(n)} is learnable from random examples to any constant error 
€ > 0 in time poly(n). (Hint: Exercise 8.43 and the OS Inequality.) 
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9.22 


9.23 


9.24 
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Deduce the following generalization of the (2, g)-Hypercontractivity The- 
orem: Let f : {-1,1}" > R, q > 2, and assume 0 < p < 1 satisfies 
p* < 1/./q — I for some 0 < A < 1. Then 


IT, fla STF lls “MF Iz 


(Hint: Show ||T, fI? < X t0”! FAAS)! - (F(S)2)* and use Hilder.) 
Let f :{-1,1}" > [-1, 1], let O<e€ <1, and assume q > 2+ 2e. 
Show that 


2\1+e 
ITi-efllg < IT FIG < (Fly. 


Recall the Gaussian quadrant probability A (jc) defined in Exercise 5.32 
by A, (i) = Pr[z; > t, Z2 > t], where z1, Z2 are standard Gaussians with 
correlation E[z;Z2] = p and t is defined by (t) = u. The goal of this 
exercise is to show that for fixed 0 < p < 1 we have the estimate 


A(t) = Õu) (9.16) 


as u — 0. In light of Exercise 5.32, this will show that the Small-Set 
Expansion Theorem for the p-stable hypercube graph is essentially sharp 
due to the example of Hamming balls of volume u. 

(a) First let’s do an imprecise “heuristic” calculation. We have Pr[z; > 
t] = Pr[z; > t] = u by definition. Conditioned on a Gaussian being 
at least ¢ it is unlikely to be much more than f, so let’s just pretend 
that z; = t. Then the conditional distribution of z2 is pt + y 1 — p?y, 
where y~ N(O, 1) is an independent Gaussian. Using the fact 
that B(u) ~ o(u)/u as u —> oo, deduce that Pr[z2 > t | zı = t] = 
Su) and “hence” (9.16) holds. 

Let’s now be rigorous. Recall that we are treating 0 < p < 1 as fixed 
and letting u — 0 (hence t —> oo). Let ¢,(z1, z2) denote the joint pdf 
of z1, Z2 so that 


(b 


wm 


Ga f I becada 


Derive the following similar-looking integral: 


| / (= Ba ds 


a n Pyre 2 t2 
= On exp itp2 (9.17) 


and show that the right-hand side is Olu ). 


9.25 


9.26 


9.27 
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(c) Show that 


2 a 

Pr [zı > =] = | Pade = Ou”) = out). 

(d) Deduce (9.16). (Hint: Try to arrange that the extraneous factors 
(z2 — p), (zı — pt) in (9.17) are both at least 1.) 

Let f : {-1, 1}” > {-1, 1}, let J C [n], and write J= [n] \ J. Define 

the coalitional influence of J on f to be 


Ínf,[ f] = Pr [fj is not constant]. 
z~{—1,1%} 

Furthermore, for b € {—1, +1} define the coalitional influence toward b 

of J on f to be 


Inf If] = Pr pe can be made b] — Pr[ f = b] 
z~{-1,17 


= Pr e # —b] — Pr[ f = b]. 


z~{ 


For brevity, we’ll sometimes write Inf Fi f] rather than Inf a [f]. 

(a) Show that for coalitions of size 1 we have Inf;[f] = Inf, [/] = 
2Inf [f]. = 

(b) Show that 0 < Inf; [f] < 1. 

(c) Show that Inf; [f] = Inf} [f] + Inf, [f]. 

(d) Show that if f is monotone, then 


MEIS] = Pri fro. = b] — Prif = b]. 


Fyi 


(e) Show that Inf z [xn] = ] for all J # Ø. 

(f) Supposing we write t=|J|/./n, show that Inf, [Maj,,] = 
@O(t) — dy o(1) and hence Inf) [Maj,, ] = 2ğ(t)— 1 + o(1). Thus 
Inf ;[Maj,] = o(1) if |J| = 0(/n) and Inf ; [Maj,] = 1l — o(1) if 
|J| = w(/n). (Hint: Central Limit Theorem.) 

(g) Show that max{Ínf "° [Tribes] : |J| < logn} = 1/2 + @(*2"). On 
the other hand, show that max{Inf’,""’[Tribes,]:|J| < k} < k- 
o( 284), Deduce that for some positive constant c we have 
max{Inf , [Tribes,,] : |J| < cn/logn} < .51. (Hint: Refer to Propo- 
sition 4.12.) 

Show that the exponential dependence on I[ f] in Friedgut’s Junta Theo- 

rem is necessary. (Hint: Exercise 4.15.) 


Let f : {—1, 1}” > {—1, 1} be a monotone function with Var[ f] > ô > 0, 
and let 0 < € < 1/2 be given. 
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(a) Improve Proposition 9. ar as follows: Show that there exists J C [n] 
with |J| < O(log +). ine such that El fia pl > ]1— e. (Hint: 
How many bribes are required to move f’s mean outside the interval 
[1 — 2n, 1 — n]?) 
(b) Show that there exists J C [n] with |J| < O(log +). oe such that 
Inf lf] > 1—. (Hint: Use Exercise 9.25(d) and take the union of 
two influential sets.) 
9.28 Let f : {-1, 1}" > {-1, 1}. 
(a) Let f* : {—1, 1}" —> {-1, 1} be the “monotonization” of f as defined 
in Exercise 2.52. Show that Inf; [f*] < Inf)[f] for all J C [n] and 
b € {—1, 1}, and hence also Inf /[ f*] < Inf/[/]. 
(b) Let Var[f] > ô > 0 and let 0 < € < ue be given. Show that there 
exists J C [n] with | J| < O(log +): Togn such that Inf j [ f] >1—e. 
(Hint: Combine part (a) with Exercise 9.27(b).) 
9.29 Establish the general-variance case of the KKL Edge-Isoperimetric The- 
orem. (Hint: You’ll need to replace (9.15) with 


Beare 


3 (1/3)! ASP = 3 Var[f] 37V vars), 
|S|=1 


Use the same convexity argument, but applied to the random vari- 

able S that takes on each outcome Ø 4 S C [n] with probability fis / 

Var[ f].) 

9.30 The goal of this exercise is to attain the best known constant factor in the 

statement of the KKL Theorem. 

(a) By using Corollary 9.25 in place of Corollary 9.12, obtain the follow- 
ing generalization of the KKL Edge-Isoperimetric Theorem: For any 
(nonconstant) f : {—-1, 1}” > {-1, l} and0 < ô < 1, 


ayi fa NF aa STU 
Maxlnf[ f] = (G)? (rin) h (G) ; 
where T'[f] denotes I[f]/ Var[ f]. (Hint: Write p = im) Deduce 
that for any constant C > e? we have 
MaxInf[ f] > &Q(C T). 


(b) More carefully, show that by taking ô = TT T srz We can achieve 


p NA 
MaxľĪnf[ f] > exp(—2I'[ f]) - e (ny ) 


A -exp(—4I'[f1'””). 


(Hint: Establish (ia)? > exp(—2 — 8?) for 0 < ô < 1/2.) 


9.31 
9.32 


9.33 


9.34 
9.35 


9.36 
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(c) By distinguishing whether or not I'[ f] > (in n — y logn), establish 
the following form of the KKL Theorem: For any f : {—1, 1y" > 
i=l; 1}, 

1 Inn 
MaxInf| f] > 5 Var[ f]: — (1 — 0,(1)). 


n 
Establish the claim in Remark 9.29. 

Show that if f :{—1,1}' —> {—1,1} is nonconstant, then there 
exists S C [n] with O < |S| < O([f]/ Var[f]) such that fisy > 
exp(—O(U[ f}?/ Var[ f1). (Hint: By mimicking Corollary 9.32’s proof 
you should be able to establish the lower bound (2(Var[/]) - 
exp(—O(U[ f}°/ Var[ f1). To show that this quantity is also 
exp(— Od f]?/ Var[ f1), use Theorem 2.39.) 

Let f :{—1, 1}" — {-1,1} be a nonconstant monotone function. 
Improve on Corollary 9.32 by showing that there exists S 44 with 
fisy > exp(—O(I[ f]/ Var[f])). (Hint: You can even get |S| < 1; use 
the KKL Edge-Isoperimetric Theorem and Proposition 2.21.) 

Let f : {-1, 1}” — R. Prove that || fla < sparsity(f)!/4|] f ll2. 

Let g =2r be a positive even integer, let p =1/./qg—1, and 
let fi,..., fr : {(—1, 1}" — R. Generalize the (2, ¢)-Hypercontractivity 
Theorem by showing that 


E Te < |] ELA) 
i=l isi 


(Hint: Hölder’s inequality.) 

In this exercise you will give a simpler, stronger version of Theorem 9.17 

under the assumption that q = 2r is a positive even integer. 

(a) Using the idea of Proposition 9.16, show that if x is a uniformly 
random +1 bit then x is (2, q, e)-hypercontractive if and only if 
p < 1/vq =]. 

(b) Show the same statement for any random variable x satisfying 
E[x?] = 1 and 


r 


Elx 7] =0, Elx//] < (2r — 1) 4 for all integers 1 < j < r. 
2j 


(c) Show that none of the even moment conditions in part (b) can be 
relaxed. 
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9.37 Let g = 2r bea positive even integer and let f : {—1, 1}" —> R be homo- 
geneous of degree k > 1 (i.e., f = f="). The goal of this problem is to 
improve slightly on the generalized Bonami Lemma, Theorem 9.21. 

(a) Show that 


ELF = YO ASD FS) < DIA SDI AFSL. (9.18) 


where the sum is over all tuples $,,..., Sp with S;A--- AS, = Ø. 

(b) Let G denote the complete q-partite graph over vertex sets V;,..., Vg, 
each of cardinality k. Let @ denote the set of all perfect matchings 
in G. Show that the right-hand side of (9.18) is equal to 


D YS AMM, OFM, D), 9.19) 


MeM t:M>in] 


where T;(M, £) denotes |_J{£(e) : e € M,e N V; £ Ø}. 
(c) Show that (9.19) is equal to 


(rk)! (KDI oi D D 5 IAU, ii, i4)) X 


MeM i= 1 i=1 hami 


oy 


|: SHX IfUM, i, EEE irk) Is 
(9.20) 


where . is the set of ordered perfect matchings of G, and now 
U\(M, ij,.--, ipg) denotes J{i, : MŒ) N V; A Ø}. 
(d) Show that for any M € Mwe have 


YYY JOM, inia iD) FUM, ii, o ir) 


i=l i=1 ik=l 
r 


< P. i TES 
(Hint: Use ee a rk times.) 
(e) Deduce that || fI < TELEN TA < |4 - Cana ales and hence 


|44" 
Jil 
9.38 The goal of this problem is to estimate | ⁄| from Exercise 9.37 so as to 


give a concrete improvement on Theorem 9.21. 
(a) Show that for q = 4, k = 2 we have |.4@| = 60. 


Iflg < 


Il fll. 
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(b) Show that |4| < (gk — 1)!!. (Aint: Show that (qk — 1)!! is the num- 
ber of perfect matchings in the complete graph on gk vertices.) 


Deduce || f lla < /@' Il flo. 
(c) Show that |4 < CEK), and thereby deduce 


k 
Ifl < Caxt va 1 Ifl, 
(rk)! 


where Cy, = ( wp (Hint: Suppose that the first t edges of the 
perfect matching have been chosen; show that there are (rk —t) 
choices for the next edge. The worst case is if the vertices used up so 
far are spread equally among the q parts.) 

(d) Give a simple proof that C4, < 1, thereby obtaining Theorem 9.21. 

(e) Show that in fact C} = O(1)-k- 44/9. (Hint: Stirling’s For- 
mula.) 

(f) Can you obtain the improved estimate 


= @,(1)-k-4. fg 12 


|.a'/4 


vk! 


(Hint: First exactly count — then estimate — the number of perfect 
matchings with exactly e;; edges between parts i and j. Then sum 
your estimate over a range of the most likely values for e;;.) 


Notes 


The history of the Hypercontractivity Theorem is complicated. Its earliest roots are 
in the work of Paley (Paley, 1932) from 1932; he showed that for 1 < p < oo there 
are constants 0 < cp < C, < oo such that c,||Sf\lp < Ilfllp < CpIlSfilp holds for 
any f :{-1, 1}" > R. Here Sf = X y} ;_(d, f} is the “square function” of f, 
and d; f = > simax) f(S) xs is the martingale difference sequence for f defined in 
Exercise 8.17. The main task in Paley’s work is to prove the statement when p is an even 
integer; other values of p follow by the Riesz(—Thorin) interpolation theorem. Using 
this result, Paley showed the following hypercontractivity result: If f : {-1, 1}" ~ R 
is homogeneous of degree 2, then c’,|| fll2 < II fllp < Coll fll2 for any p € Rt. 

In 1968 Bonami (Bonami, 1968) stated the following variant of Theorem 9.21: If 
f : {-1, 1}” —> R is homogeneous of degree k, then for all q > 2, || fll < cryg f ll2; 
where the constant cg may be taken to be 1 if q is an even integer. She remarks that 
this theorem can be deduced from Paley’s result but with a much worse (exponential) 
dependence on q. The proof she gives is combinatorial and actually only treats the case 
k = 2 and q an even integer; it is similar to Exercise 9.37. 

Independently in 1969, Kiener (Kiener, 1969) published his Ph.D. thesis, which 
extended Paley’s hypercontractivity result as follows: If f : {—1, 1}” > R is homoge- 
neous of degree k, then cy ll fll < If llp < Cp,ell fll2 for any p € R*. The proof is an 


induction on k, and again the bulk of the work is the case of even integer p. Kiener also 
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gave a long combinatorial proof showing that if f : {—1, 1}" —> R is homogeneous of 
degree 2, then E[ f*] <51 E[f?F. (Exercise 9.38(a) improves this 51 to 15.) 

Also independently in 1969, Schreiber (Schreiber, 1969) considered multilinear 
polynomials f over a general orthonormal sequence x1, ...,Xn„ of centered real (or 
complex) random variables. He showed that if f has degree at most k, then for any 
even integer q > 4 it holds that || f||, < CII f |l2, where C depends only on k, q, and the 
q-norms of the x;’s. Again, the proof is very similar to Exercise 9.37; Schreiber does not 
estimate his analogue of |./| but merely notes that it’s finite. Schreiber was interested 
mainly in the case that the x;’s are Gaussian; indeed, his 1969 work (Schreiber, 1969) 
is a generalization of his earlier work (Schreiber, 1967) specific to the Gaussian case. 

In 1970, Bonami published her Ph.D. thesis (Bonami, 1970), which contains the 
full Hypercontractivity Theorem as stated at the beginning of the chapter. Her proof 
follows the standard template seen in essentially all proofs of hypercontractivity: first 
an elementary proof for the case n = 1 and then an induction to extend to general n. 
She also gives the sharper combinatorial result appearing in Exercises 9.37 and 9.38(c). 
(The stronger bound from Exercise 9.38(f) is due to Janson (Janson, 1997, Remark 
5.20).) As in Corollary 9.6, Bonami notes that her combinatorial proof can be extended 
to a general sequence of symmetric orthonormal random variables, at the expense of 
including factors of ||x; ||, into the bound. She points out that this includes the Gaussian 
case independently studied by Schreiber. 

Bonami’s work was published in French, and it remained unknown to most English- 
language mathematicians for about a decade. In the late 1960s and early 1970s, 
researchers in quantum field theory developed the theory of hypercontractivity for 
the Gaussian analogue of T,, namely, the Ornstein—Uhlenbeck operator U,. This is now 
recognized as essentially being a special case of hypercontractivity for bits, in light of 
the fact that a tends to a Gaussian as n — œ by the CLT (see Chapter 11.1). We 
summarize here some of the work in this setting. In 1966 Nelson (Nelson, 1966) showed 
that |U gat lla < Call f ll2 for all q > 2. Glimm (Glimm, 1968) gave the alternative 
result that foreach q > 2 there is a sufficiently small p, > 0 such that ||U,, f lla < Il fll2- 
Segal (Segal, 1970) observed that hypercontractive results can be proved by induction 
on the dimension n. In 1973 Nelson (Nelson, 1973) gave the full Hypercontractivity 
Theorem in the Gaussian setting: ||U yg=ryqanf lla < Il fllp for all 1 < p < q < œ. 
He also proved the combinatorial Exercise 9.37. The equivalence to the Two-Function 
Hypercontractivity Theorem is from the work of Neveu (Neveu, 1976). 

In 1975 Gross (Gross, 1975) introduced the notion of Log-Sobolev Inequalities 
(see Exercise 10.23) and showed how to deduce hypercontractivity inequalities from 
them. He established the Log-Sobolev Inequality for 1-bit functions, used induction 
(citing Segal) to obtain it for n-bit functions, and then used the CLT to transfer results 
to the Gaussian setting. (For some earlier results along these lines, see the works 
of Federbush and Gross (Federbush, 1969; Gross, 1972).) This gave a new proof of 
Nelson’s result and also independently established Bonami’s full Hypercontractivity 
Theorem. Also in 1975, Beckner (Beckner, 1975) published his Ph.D. thesis, which 
proved a sharp form of the hypercontractive inequality for purely complex p. (It is 
unfortunate that the influential paper of Kahn, Kalai, and Linial (Kahn et al., 1988) 
miscredited the Hypercontractivity Theorem to Beckner.) The case of general complex p 
was subsequently treated by Weissler (Weissler, 1979), with the sharp result being 
obtained by Epperson (Epperson, 1989). Weissler (Weissler, 1980) also appears to 
have been the first to make the connection between this line of work and Bonami’s 
thesis. 
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Independently of all this work, the (q, 2)-Hypercontractivity Theorem was reproved 
(without sharp constant) in the Banach spaces community by Rosenthal (Rosenthal, 
1976) in 1975, using methods similar to those of Paley and Kiener. For additional early 
references, see Miiller (Miiller, 2005, Chapter 1). 

The term “hypercontractivity” was introduced in Simon and Hgegh-Krohn (Simon 
and Hgegh-Krohn, 1972); Definition 9.13 of a hypercontractive random variable is 
due to Krakowiak and Szulga (Krakowiak and Szulga, 1988). The short inductive 
proof of the Bonami Lemma may have appeared first in Mossel, O’Donnell, and 
Oleszkiewicz (Mossel et al., 2005a). Theorems 9.22 and 9.24 appear in Janson (Janson, 
1997). Theorem 9.23 dates back to Pisier and Zinn and to Borell (Pisier and Zinn, 
1978; Borell, 1979). The Small-Set Expansion Theorem is due to Kahn, Kalai, and 
Linial (Kahn et al., 1988); the Level-k Inequalities appear in several places but can 
probably be fairly credited to Kahn, Kalai, and Linial (Kahn et al., 1988) as well. The 
optimal constants for Khintchine’s Inequality were established by Haagerup (Haagerup, 
1982); see also Nazarov and Podkorytov (Nazarov and Podkorytov, 2000). They always 


occur either when J`; a;x; is just Fi + BX? or in the limiting Gaussian case of 


a=, 

Ben Or and Linial’s work (Ben-Or and Linial, 1985, 1990) was motivated both 
by game theory and by the Byzantine Generals problem (Lamport et al., 1982) from 
distributed computing; the content of Exercise 9.25 is theirs. In turn it motivated the 
watershed paper by Kahn, Kalai, and Linial (Kahn et al., 1988). (See also the interme- 
diate work of Chor and Geréb-Graus (Chor and Geréb-Graus, 1987).) The “KKL Edge- 
Isoperimetric Theorem” (which is essentially a strengthening of the basic KKL Theo- 
rem) was first explicitly proved by Talagrand (Talagrand, 1994) (possibly independently 
of Kahn, Kalai, and Linial (Kahn et al., 1988)?); he also treated the p-biased case. There 
is no known combinatorial proof of the KKL Theorem (i.e., one which does not involve 
real-valued functions). However, several slightly different analytic proofs are known; 
see Falik and Samorodnitsky (Falik and Samorodnitsky, 2007), Rossignol (Rossignol, 
2006), and O’ Donnell and Wimmer (O’ Donnell and Wimmer, 2013). The explicit lower 
bound on the “KKL constant” achieved in Exercise 9.30 is the best known; it appeared 
first in Falik and Samorodnitsky (Falik and Samorodnitsky, 2007). It is still a factor of 2 
away from the best known upper bound, achieved by the tribes function. 

Friedgut’s Junta Theorem dates from 1998 (Friedgut, 1998). The observation that its 
junta size can be improved for functions which have W‘[f] < e for k < I[f]/e was 
independently made by Li- Yang Tan in 2011; so was the consequence Corollary 9.31 and 
its extension to constant-degree PTFs. A stronger result than Corollary 9.31 is known: 
Diakonikolas and Servedio (Diakonikolas and Servedio, 2009) showed that every LTF 
is €-close to a IL f poly(1/e)-junta. As for Corollary 9.30, it’s incomparable with a 
result from Gopalan, Meka, and Reingold (Gopalan et al., 2012), which shows that 
every width-w DNF is e-close to a (w log(1/e))?™-junta. 

Exercise 9.3 was suggested to the author by Krzysztof Oleszkiewicz. Exercise 9.12 
is from Gopalan et al. (Gopalan et al., 2010). Exercise 9.21 appears in O’Donnell 
and Servedio (O’Donnell and Servedio, 2007); Exercise 9.22 appears in O’ Donnell 
and Wu (O’Donnell and Wu, 2009). The estimate in Exercise 9.24 is from de Klerk, 
Pasechnik, and Warners (de Klerk et al., 2004) (see also Rinott and Rotar’ (Rinott and 
Rotar’, 2001) and Khot et al. (Khot et al., 2007)). Exercises 9.27 and 9.28 are due to 
Kahn, Kalai, and Linial (Kahn et al., 1988). Exercise 9.34 was suggested to the author 
by John Wright. Exercise 9.36 appears in Kauers et al. (Kauers et al., 2013). 


10 
Advanced Hypercontractivity 


In this chapter we complete the proof of the Hypercontractivity Theorem for 
uniform +1 bits. We then generalize the (p, 2) and (2, q) statements to the 
setting of arbitrary product probability spaces, proving the following: 


The General Hypercontractivity Theorem. Let (Q1, 71), ..., (Qn, Tn) be 

finite probability spaces, in each of which every outcome has probability at 

least à. Let f € L?(Qy X +++ X Qn, T1 Q- Q Tn). Then for any q > 2 and 
1. 31/2-1/4 

O20S za À ; 


ITo fla < Ifl2 and To fille < Ifl- 


(And in fact, the upper bound on p can be slightly relaxed to the value stated 
in Theorem 10.18.) 


We can thereby extend all the consequences of the basic Hypercontractiv- 
ity Theorem for f : {—1, 1}" > R to functions f € L?(Q", z8”), except with 
quantitatively worse parameters depending on “A”. We also introduce the tech- 
nique of randomization/symmetrization and show how it can sometimes elim- 
inate this dependence on À. For example, it’s used to prove Bourgain’s Sharp 
Threshold Theorem, a characterization of Boolean-valued f € LQ”, n8”) 
with low total influence that has no dependence at all on z. 


10.1. The Hypercontractivity Theorem for Uniform +1 Bits 


In this section we’ ll prove the full Hypercontractivity Theorem for uniform +1 
bits stated at the beginning of Chapter 9: 
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The MHypercontractivity Theorem. Let f:{-1,1}" >R and let 


1 < p <q < œ. Then |ITpfllq < I fllp foro < p < J. 
Actually, when neither p nor q is 2, the following equivalent form of theorem 
seems easier to interpret: 


Two-Function Hypercontractivity Theorem. Let f, g :{—1, 1}" > R, let 
r,s > 0, and assume 0 < p < yrs < 1. Then 


as LEBON S MF lellis. 


p-correlated 


As a reminder, the only difference between this theorem and its “weak” form 
(proven in Chapter 9.4) is that we don’t assume r, s < 1. Below we will show 
that the two theorems are equivalent, via Hölder’s inequality. Given the Two- 
Function Hypercontractivity Induction Theorem from Chapter 9.4, this implies 
that to prove the Hypercontractivity Theorem for general n we only need to 
prove it for n = 1. This is an elementary but technical inequality, which we 
defer to the end of the section. 

Before carrying out these proofs, let’s take some time to interpret the 
Two-Function Hypercontractivity Theorem. One interpretation is simply as 
a generalization of Hélder’s inequality. Consider the case that the strings x 
and y in the theorem are fully correlated; i.e., o = 1. Then the theorem states 
that 


EL f(x)g(*)] < MF ilgli (10.1) 


because the condition ,/rs = 1 is equivalent to s = 1/r. This statement is 
identical to Hölder’s inequality, since (1 + ry = 1 + 1/r. Hölder’s inequality 
is often used to “break the correlation” between two random variables; in the 
absence of any information about how f and g correlate then we can at least 
bound E[ f (x)g(x)] by the product of certain norms of f and g. (If f and g have 
different “sizes”, then Hölder lets us choose different norms for them; if f and g 
have roughly the same “size”, then we can take r = s = 1 and get Cauchy- 
Schwarz.) Now suppose we are considering E[ f (x)g(y)] for p-correlated x, y 
with p < 1. In this case we might hope to improve (10.1) by using smaller 
norms on the right-hand side; in the extreme case of independent x, y (ie., 
p = 0) we can use E[ f(x) ¢(y)] = EL f] Elg] < || fllallg|li- The Two-Function 
Hypercontractivity Theorem gives a precise interpolation between these two 
cases; the smaller the correlation p is, the smaller the norms we may take on 
the right-hand side. 
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In the case that f and g have range {0, 1}, these ideas yield another inter- 
pretation of the Two-Function Hypercontractivity Theorem, namely a two-set 
generalization of the Small-Set Expansion Theorem: 


Generalized Small-Set Expansion Theorem. Let 0 < p < 1. Let A, BC 
{—1, 1}” have volumes exp(—%), exp(—2) and assume 0 < pa <b <a. 
Then 

P 


r 
(x,y) 
p-correlated 


[x € A, y € B] < exp (— 12 peat) 


Proof. Apply the Two-Function Hypercontractivity Theorem with f = 14, 


g = 1g and minimize the right-hand side by selecting r = p ana s= pre. 


Remark 10.1. When a and b are not too close the optimal choice of r in the 
proof exceeds 1. Thus the Generalized Small-Set Expansion Theorem really 
needs the full (non-weak) Two-Function Hypercontractivity Theorem; equiva- 
lently, the full Hypercontractivity Theorem. 


Remark 10.2. This theorem is essentially sharp in the case that A and B are 
concentric Hamming balls; see Exercise 10.5. In the case b = a we recover 
the Small-Set Expansion Theorem. In the case b = pa we get only the trivial 
bound that Pr[x € A, y € B] < exp(—) = Pr[x € A]. However, not much 
better than this can be expected; in the concentric Hamming ball case it indeed 
holds that Pr[x € A, y € B] ~ Pr[x € A] whenever b < pa. 


Remark 10.3. There is also a reverse form of the Hypercontractivity Theorem 
and its Two-Function version; see Exercises 10.6—10.9. It directly implies the 
following: 


Reverse Small-Set Expansion Theorem. Let 0<p<1. Let A,BC 
{—1, 1}” have volumes exp(—5), exp(—*), where a, b > 0. Then 


Pr [x € A, y € B] = exp ( 2 1-7 


1 a?+2pab+b? ) 
x,y f 
p-correlated 


We now turn to the proofs. We begin by showing that the Hypercontractivity 
Theorem and the Two-Function version are indeed equivalent. This is a conse- 
quence of the following general fact (take T = Tp, p = 1 +r, q = 1 + 1/5): 


Proposition 10.4. Let T be an operator on L?(Q&, z) and let 1 < p, q < œ. 
Then 


ITF lq < WFllp (10.2) 
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holds for all f € L?(Q, 1) if and only if 
(Tf, 8) < IF llpligila (10.3) 
holds for all f, g € L?(Q, 7). 


Proof. For the “only if” statement, (Tf, 8) < IIT llqllg lla < If llpligila by 
Holder’s inequality and (10.2). As for the “if” statement, by Hélder’s inequality 
and (10.3) we have 


IT fllq = sup (Tf,g) < op II Fllpllglla = Filly. 
lell =1 lelg =1 


Now suppose we prove the Hypercontractivity Theorem in the case n = 1. 
By the above proposition we deduce the Two-Function version in the case 
n = 1. Then the Two-Function Hypercontractivity Induction Theorem from 
Chapter 9.4 yields the general-n case of the Two-Function Hypercontractivity 
Theorem. Finally, applying the above proposition again we get the general-n 
case of the Hypercontractivity Theorem, thereby completing all needed proofs. 
These observations all hold in the context of more general product spaces, so 
let’s record the following for future use: 


Hypercontractivity Induction Theorem. Let0 < p < 1,1 < p,q < œ, and 
assume that ||T pf \lq < I| fllp holds for every f € PE? (Qy 71), 45? Qa Tn). 
Then it also holds for every f € LQ) Xs) X Qn, 1] @ +++ @ My). 


Remark 10.5. In traditional proofs of the Hypercontractivity Theorem for +1 
bits, this theorem is proven directly; it’s a slightly tricky induction by derivatives 
(see Exercise 10.3). For more general product spaces the same direct induction 
strategy also works but the notation becomes quite complicated. 


Our remaining task, therefore, is to prove the Hypercontractivity Theorem 
in the case n = 1; in other words, to show that a uniformly random +1 bit is 
(p,q, „(p — D/(q — 1))-hypercontractive. This fact is often called the “Two- 
Point Inequality” because (for fixed p, q, and p) it’s just an “elementary” 
inequality about two real variables. 


Two-Point Inequality. Let l<p<q<o and let O<p< 
V@=DIG=D. Then ITpfle <Iflp for any f:{-1,1}>R. 
Equivalently (for p # 1), a uniformly random bit x ~ {—1, 1} is (p,q, p)- 
hypercontractive; i.e., |a + pbx||q < \la + bx||, foralla,b € R. 


Proof. As in Section 9.3, our main task will be to prove the inequality for 
1 <p <q <2. Having done this, the 2 < p <q <œ cases follow from 
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Proposition 9.19, the p < 2 < q cases follow using the semigroup property 
of T, (Exercise 9.17), and the p = q cases follow from Exercise 2.33 (or 
continuity). The proof for 1 < p < q < 2 will be very similar to that of The- 
orem 9.18 (the g = 2 case). As in that proof we may reduce to the case that 


= J/(p — 1)/(q — 1), a = 1, and b = € satisfies |e| < 1. It then suffices to 


show 


[1 + pexll? < I|1 + exl? 
= GAH + $(1 = po aH- 
oo P/q 
= (Žore) 
k=1 
Again we used |e| < | to drop the absolute value signs and justify the Gen- 


eralized Binomial Theorem. For each of the binomial coefficients on the left 
in (10.4) we have 


IA 


1450 (fe. (10.4) 


k=1 


(4) — 4g—VG—2)(q—3)--(g=2k-~2))(q-C2k-V) 
2k CK)! 


— 4q- D-4)3B-4)--((2k—-2)—4X(2k-1)-4) 
= CK)! = 0. 


(Here we reversed an even number of signs, since 1 < q < 2. We will later 
do the same when expanding (2 ak ) Thus we can again employ the inequality 
(+4? <1+4+6t for t > 0 and 0 <90 < 1 to deduce that the left-hand side 
of (10.4) is at most 


1+ D4 oe ate oe (= Vg yer, 


We can now complete the proof of (10.4) by showing the following term-by- 
term inequality: for all k > 1, 


(HF QA 


P iy aq- DQ-4)-(Ck-1)-4) < PP-DQC-p)-(k-1)-p) 
q-1 (2k)! (2k)! 


p 
= 


s 2—4 , 3—4 ,,, @k-D-4 < 2P , 3- ,,, @k-D-p 
q—l1 q—i q-1 — Jp-l1 p-l pl 


And indeed this inequality holds factor-by-factor. This is because p < q and 
= is a decreasing function of r > 1 for all j > 2, as is evident from 


d j-r j-2+r 


dr J/r-1 (r—1)2 
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Remark 10.6. The upper-bound p < /(p — 1)/(q — 1) in this theorem is best 
possible; see Exercise 9.10(b). 


10.2. Hypercontractivity of General Random Variables 


Let’s now study hypercontractivity for general random variables. By the end 
of this section we will have proved the General Hypercontractivity Theorem 
stated at the beginning of the chapter. 

Recall Definition 9.13 which says that X is (p, q, p)-hypercontractive if 
E[|X|7] < œ and 


la + pbX||q < |la+bX||p forall constants a, b € R. 


(By homogeneity, it’s sufficient to check this either with a fixed to 1 or with b 
fixed to 1.) Let’s also collect some additional basic facts regarding the concept: 


Fact 10.7. Suppose X is (p,q, p)-hypercontractive (1 < p <q < œ, 0 < 
p < 1). Then: 


(1) E[X] = 0 (Exercise 9.10). 

(2) cX is (p, q, p)-hypercontractive for any c € R (Exercise 9.9). 

(3) X is (p,q, p')-hypercontractive for any 0 < p' < p (Exercise 9.11). 
=l IXI (Exercises 9.10, 9.9). 


(4) pP < EV P= and p < Xi 


q-1 


Proposition 10.8. Let X be (2,q, p)-hypercontractive. Then X is also 
(q', 2, p)-hypercontractive, where q' is the conjugate Hélder index of q. 


Proof. The deduction is essentially the same as (9.6) from Chapter 9.2. Since 
E[X] = 0 (Fact 10.7(1)) we have 


la + pbX ||5 = Ela’ + 2pabX + p?b?X?] = E[(a t+ bX)(a + p7bX)). 


By Holder’s inequality and then the (2, q, p)-hypercontractivity of X this is at 
most 


la + bX lq lla + p°bX lq < lla + bX lq lla + pbX lo. 


Dividing through by ||a + obX ||z (which can’t be 0 unless X = 0) gives ||a + 
pbX |l2 < lla + DX ||q’ as needed. 


Remark 10.9. The converse does not hold; see Exercise 10.4. 
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Remark 10.10. As mentioned in Proposition 9.15, the sum of independent 
hypercontractive random variables is equally hypercontractive. Furthermore, 
low-degree polynomials of independent hypercontractive random variables are 
“reasonable”. See Exercises 10.2 and 10.3. 


Given X, p, and q, computing the largest pọ for which X is (p,q, p)- 
hypercontractive can often be quite a chore. However, if you’re not overly 
concerned about constant factors then things become much easier. Let’s focus 
on the most useful case, p = 2 and g > 2. By Fact 10.7(2) we may assume 
|| X ||> = 1. Then we can ask: 


Question 10.11. Let E[X] = 0, ||X||2 = 1, and assume ||X ||, < 00. For what 
p is X (2, q, p)-hypercontractive? 


In this section we’ll answer the question by showing that o = ©,(1/||X||q) 
is sufficient. By the second part of Fact 10.7(4), o < 1/||X|lq is also neces- 
sary. So for a mean-zero random variable X, the largest o for which X is 


(2, q, e)-hypercontractive is always within a constant (depending only on q) of 
IX ll2 
IX l . . . a . . 

Let’s arrive at this result in steps, introducing the useful techniques of 


symmetrization and randomization along the way. When studying hypercon- 
tractivity of a random variable X, things are much more convenient if X is 
a symmetric random variable, meaning —X has the same distribution as X. 
One advantage of symmetric random variables X is that they have E[X*] = 0 
for all odd k € N. Using this it is easy to prove (Exercise 10.11) the fol- 
lowing fact, similar to Corollary 9.6. (The proof similar to that of Pro- 
position 9.16.) 


Proposition 10.12. Let X be a symmetric random variable with ||X ||. = 1. 
Assume ||X||4=C (hence X is “C*_reasonable ”). Then X is (2,4, p)- 
hypercontractive if and only if p < min(=;, +). 


Given a symmetric random variable X, the randomization trick is to 
replace X by the identically distributed random variable r X, wherer ~ {—1, 1} 
is an independent uniformly random bit. This trick sometimes lets you reduce 
a probabilistic statement about X to a related one about r. 


Theorem 10.13. Let X be a symmetric random variable with || X ||, = 1 and let 
1 
CJq-1" 


|X lig = C, where q > 2. Then X is (2, q, p)-hypercontractive for p = 
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Proof. Letr ~ {—1, 1} be uniformly random and let X denote X /C. Then for 
anya €R, 


la + pX|2 = lla + pr X|? (by symmetry of X) 


2/4 
= E [Ella + orx] 


2/q 
< E [Ella + trx] (r is (2, q, Fao) hypercontractive) 
= Ea’ + KLP (Parseval) 
= la? + Yla (norm with respect to X) 
<a? + IX" lla (triangle inequality for || - ||,/2) 
=a? + |1? 


a +1 = a’ + E[X?] = lja + X15, 


where the last step also used E[X] = 0. 


Next, if X is not symmetric then we can use a symmetrization trick to make 
it so. One way to do this is to replace X with the symmetric random variable 
X — X', where X’ is an independent copy of X. In general X — X’ has similar 
properties to X. In particular, if E[X] = 0 we can compare norms using the 
following one-sided bound: 


Lemma 10.14. Let X be a random variable satisfying E[X] = 0 and ||X||q < 
oo, where q > 1. Then for any a € R, 


la + Xll; < lla +X —X'llq, 
where X' denotes an independent copy of X. 
Proof. We have 
la + Xf = Ella + X|4] = Ella + X — E[X']|"], 
where we used the fact that ELX’ | X] = 0. But now 
E[|a + X — E[X’]|*] = E[| Ela + X — X']\"] < Ella + X — X"1] 
= |la+X— X' |4, 


where we used convexity of t œ> |t|1 
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A combination of the randomization and symmetrization tricks is to replace 
an arbitrary random variable X by rX, where r ~ {—1, 1} is an independent 
uniformly random bit. This often lets you extend results about symmetric ran- 
dom variables to the case of general mean-zero random variables. For example, 
the following hypercontractivity lemma lets us reduce to the case of a symmetric 
random variable while only “spending” a factor of $: 


Lemma 10.15. Let X be a random variable satisfying E[X] = 0 and || X||4 < 
oo, where q > 1. Then for anya € R, 


la +5Xll4 < lla +rXllq, 
where r ~ {—1, 1} is an independent uniformly random bit. 
Proof. Letting X’ be an independent copy of X we have 


la+ $X|lq < lla + ¿X — 3X"llq (Lemma 10.14 applied to 4X) 


< |a + r(5X = 5X llq (since 5X — 5X’ is symmetric) 
ee 1 1 low 
= 34 + 5rX + 5a — 5rX lla 


ła + 4$rX|lq + l|4a—4rX’ || (triangle inequality for || - ||) 


= |ia + rX |l + dat $r X'lla (—r distributed as r) 


= lla +rX|l4. 


By employing these randomization/symmetrization techniques we obtain a 


(2, q)-hypercontractivity statement for all mean-zero random variables X with 
lk bounded, giving a good answer to Question 10.11: 


Theorem 10.16. Let X satisfy E[X] = 0, ||X|l2 = 1, ||Xll4 = C, where q > 2. 


Then X is (2, q, 5 p)-hypercontractive for p= . (If X is symmetric, 


1 
va-1llX lla 
then the factor of 5 may be omitted.) 


Proof. By Lemma 10.15 we have 
la + bX? < lla + pr XÈ. 


Since r X is a symmetric random variable satisfying ||r X ||. = 1, ||rX||z = C, 
Theorem 10.13 implies 


la + prXIĝ < llatrX|j =a? +1 = la + XI. 


This completes the proof. 
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If X is a discrete random variable then instead of computing ie 
4 


sometimes be convenient to use a bound based on the minimum value of 
X’s probability mass function. The following is a simple generalization of 
Proposition 9.5, whose proof is left for Exercise 10.17: 


it can 


Proposition 10.17. Let X be a discrete random variable with probability mass 
function 1. Write 
A=min(z)= min {Pr[X = x]}. 
xerange(X) 
Then for any q > 2 we have ||X||q < (1/ayl/2-"/4 . |X Io. 

As a consequence of Theorem 10.16, if in addition E[X]=0 then X 
is (2,q, 5 p)-hypercontractive for p= Fa . A1714, and also (q',2, 5p)- 
hypercontractive by Proposition 10.8. (If X is symmetric then the factor of 5 
may be omitted.) 


For each q > 2, the value p = @,(A!/?~'/4) in Proposition 10.17 has the 
optimal dependence on i, up to a constant. In fact, a perfectly sharp ver- 
sion of Proposition 10.17 is known. The most important case is when X is a 
A-biased bit; more precisely, when X = $(x;) for x; ~ m, in the notation of 
Definition 8.39. In that case, the below theorem (whose very technical proof 
is left to Exercises 10.19-10.21) is due to Latata and Oleszkiewicz (Latata 
and Oleszkiewicz, 1994). The case of general discrete random variables is a 
reduction to the two-valued case due to Wolff (Wolff, 2007). 


Theorem 10.18. Let X be a mean-zero discrete random variable and let à < 
1/2 be the least value of its probability mass function, as in Proposition 10.17. 
Then for q > 2 it holds that X is (2, q, e)-hypercontractive and (q', 2, p)- 
hypercontractive for 


_ | explu/q) — exp(—u/q) 
exp(u/q’) — exp(—u/q') 


uy MRD): with u defined by exp(—u) = ;4;. (10.5) 
sinh(u/q’)’ ~~ 


This value of p is optimal, even under the assumption that X is two-valued. 


Remark 10.19. It’s not hard to see that for à —> 1/2 (hence u — 0) we get 
P> 4 e SD = Fear consistent with the Two-Point Inequality from Sec- 


tion 10.1. Also, for 4 —> 0 (hence u —> œœ) we get p ~ Je = A14, 


showing that Proposition 10.17 is sharp up to a constant. Exercise 10.18 asks 
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you to investigate the function defining p in (10.5) more carefully. In particular, 
you’ll show that p > Wat -41/2-1/¢ holds for all A. Hence we can omit the 
factor of 5 from the simpler bound in Proposition 10.17 even for nonsymmetric 
random variables. 


Corollary 10.20. Let (Q, x) be a finite probability space, |Q| > 2, in which 
every outcome has probability at least à. Let f € L*(Q, 2). Then for any q > 2 
and0 < p< Fat ASA 


ITs fla < Ifl2 and To fille < Ifl- 


Proof. Recalling Chapter 8.3, this follows from the decomposition f(x) = 
f’ + f=", under which T, f = f’ + pf =). Note that for x ~ 7 the random 
variable f={} (x) has mean zero, and the least value of its probability mass 
function is at least A. 


The General Hypercontractivity Theorem stated at the beginning of the chap- 
ter now follows by applying the Hypercontractivity Induction Theorem from 
Section 10.1. 


10.3. Applications of General Hypercontractivity 


In this section we will collect some applications of the General Hypercontractiv- 
ity Theorem, including generalizations of the facts from Section 9.5. We begin 
by bounding the g-norms of low-degree functions. The proof is essentially the 
same as that of Theorem 9.21; see Exercise 10.28. 


Theorem 10.21. Jn the setting of the General Hypercontractivity Theorem, if f 
has degree at most k, then 


Iflg < Va — LAMPS flo. 


Next we turn to an analogue of Theorem 9.22, getting a relationship between 
the 2-norm and the 1-norm for low-degree functions. The proof (Exercise 10.31) 
needs (2, q, ¢)-hypercontractivity with q tending to 2, so to get the most elegant 
statement requires appealing to the sharp bound from Theorem 10.18: 


Theorem 10.22. In the setting of the General Hypercontractivity Theorem, if 
f has degree at most k, then 


1/(1—2a) 
k 1-a 
Lfl < cON WS lh, where ca) = y = . 
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We have c(A) ~ 1/ VA as à — 0, c(à) > e as à > 5, and in general, c(à) < 


e/V 2x. 
Just as in Chapter 9.5 we obtain (Exercise 10.32) the following as a corollary: 


Theorem 10.23. In the setting of the General Hypercontractivity Theorem, if 
f is a nonconstant function of degree at most k, then 


Pr. [ f(x) > EL fl] = He?/2a)* > (15/ay*. 


Extending Theorem 9.23, the concentration bound for degree-k functions, 
is straightforward (see Exercise 10.33). We again get that the probability of 
exceeding f standard deviations decays like exp(— @(¢?/*)), though the constant 
in the ©(-) is linear in à: 


Theorem 10.24. In the setting of the General Hypercontractivity Theorem, if 
f has degree at most k, then for any t > J2efh, 


Pr fe) = tll fllal <a exp (£a). 


Next, we give a generalization of the Small-Set Expansion Theorem, the 
proof being left for Exercise 10.34. 


Theorem 10.25. Let (Q, 2) be a finite probability space, |Q| > 2, in which 
every outcome has probability at least à. Let A C Q” have “volume” a; i.e., 
suppose Pry~ror[x € A] =a. Let q > 2. Then for any 


(or even p as large as the square of the quantity in Theorem 10.18) we have 


Stab,[14]= Pr [xe A,ye A] <a, 


Y~No(x) 


Similarly, we can generalize Corollary 9.25, bounding the stable influence of a 
coordinate by a power of the usual influence: 


Theorem 10.26. In the setting of Theorem 10.25, if f : Q —> {—1, 1}, then 
pint,” [F] < nf; [ fP. 
for alli € [n]. In particular, by selecting q = 4 we get 


KOASAS < mf LAP”. (10.6) 


S>di 
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Proof. Applying the General Hypercontractivity Theorem to L; f and squaring 
we get 


IT zL fla < Li flip 


By definition, the left-hand side is pInf? : [f]. The right-hand side is 
(WL; f 18, and Li fig < Inf;[f] by Exercise 8.10(b). 


The KKL Edge-Isoperimetic Theorem in this setting now follows by an 
almost verbatim repetition of the proof from Chapter 9.6. 


KKL Isoperimetric Theorem for general product space domains. Jn the set- 
ting of the General Hypercontractivity Theorem, suppose f has range {—1, 1} 
and is nonconstant. Let [f] = I[ f]/ Var[ f] > 1. Then 


MaxInf[ f] > app - O/ay tl, 


As a consequence, MaxInf[ f] > Gea) - Var[f]- ns 
Proof. (Cf. Exercise 9.29.) The proof is essentially identical to the one in 


Chapter 9.6, but using (10.6) from Theorem 10.26. Summing this inequality 
over alli € [n] yields 


Ye ISAIAS < > Inf, [F2 < MaxInf[ f1"? - f]. (10.7) 


SC[n] i=1 


On the left-hand side above we will drop the factor of |S| for |S] > 0. We 
also introduce the set-valued random variable S defined by Pr[S = S] = 
ees Var[ f] for S 4 Ø. Note that E[|S|] = I'L f]. Thus 


LHS(10.7) > Varlf]- El(V2/3)1] = Vart f] - (2/3) 


= Var[ f}-(W/2/3) 1, 


where we used that s b> (Sh /3)* is convex. The first statement of the theorem 
now follows after rearrangement. As for the second statement, there is some 
universal c > O such that 


If] <c- g len => pe O/T = Opry) = — 


say, in which case our lower bound for MaxInf[ f ] is Fi > “ORR On the other 
hand, 


If] >c- g osn => US] = QG gp): Vals logn, 


logn 
ae: 


in which case even the average influence of f is QD) - Var[f]- 
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Similarly, essentially no extra work is required to generalize Theorem 9.28 
and Friedgut’s Junta Theorem to general product space domains; see Exer- 
cise 10.35. For example, we have: 


Friedgut’s Junta Theorem for general product space domains. Jn the set- 
ting of the General Hypercontractivity Theorem, if f has range {—1, 1} and 
0 <e€ <1, then f is €-close to a a/X2 W9 - junta h : Q” + {-1, 1} (ie, 
Pry~nor[ f(x) Z h(x)] < €). 

We conclude this section by establishing “sharp thresholds” — in the sense 
of Chapter 8.4 — for monotone transitive-symmetric functions with critical 
probability in the range [1/n°™, 1 — 1/n°]. Let f : {—1, 1}” > {—1, 1} be 
a nonconstant monotone sree and define the (strictly increasing) curve 
F : [0,1] > [0, 1] by F(p) = Pr,~,2[ f(x) = —1]. Recall that the critical 
probability pe is defined to be = value such that F(p.) = 1/2; equivalently, 
such that Var[ f2] = 1. Recall also the Margulis—Russo Formula, which says 
that 


d 
a J, 
T F(p)= — -Hf 


where 


o? = 0°(p) = Var[x;] = 4p(1 — p) = Omin(p, | — p)). 


Remark 10.27. Since we will not be concerned with constant factors, it’s 
helpful in the following discussion to mentally replace o? with min(p, 1 — p). 
In fact it’s even more helpful to always assume p < 1/2 and replace o? with p. 


Now suppose f is a transitive-symmetric function, e.g., a graph property. 
This means that all of its influences are the same, i.e., 


Inf; [fP] = MaxInf[ f] = “af 


for all i € [n]. It thus follows from the KKL Theorem for general product 
spaces that 


(P)) (p) . 
Wf] = (rarigi) * Varl fP]: logn; 


hence 


- logn. (10.8) 


d 
a” = Var[ fP] : Q(z melo) 


(As mentioned in Remark 10.27, assuming p < 1/2 you can read o? In(e/o7) 
as p log(1/p).) 
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If we take p = pe in inequality (10.8) we conclude that F(p) has a large 
derivative at its critical probability: F’(p.) > AU) - logn, assuming 
Pc < 1/2. In particular if log(1/pe) & logn — that is, pe > 1/n°™® — then 
F'(p) = o>). This suggests that f has a “sharp threshold”; i.e., F(p) jumps 
from near 0 to near 1 in an interval of the form p,(1 + o(1)). However, largeness 
of F’(p-) is not quite enough to establish a sharp threshold (see Exercise 8.30); 
we need to have F'(p) large throughout the range of p near pe where Var[ f”] 


is large. Happily, inequality (10.8) provides precisely this. 


Remark 10.28. Even if we are only concerned about monotone functions f 
with pe = 1/2, we still need the KKL Theorem for general product spaces to 
establish a sharp threshold. Though F’(1/2) > Q(dogn) can be derived using 
just the uniform-distribution KKL Theorem from Chapter 9.6, we also need to 
know that F'(p) > Q(ogn) continues to hold for p = 1/2 + O(1/logn). 


Making the above ideas precise, we can establish the following result of 
Friedgut and Kalai (Friedgut and Kalai, 1996) (cf. Exercises 8.28, 8.29): 


Theorem 10.29. Let f : {—1, 1}” —> {-1, 1} be a nonconstant, monotone, 
transitive-symmetric function and let F : [0, 1] — [0, 1] be the strictly increas- 
ing function defined by F(p) = Pry~,2[ f(x) = —1]. Let pe be the critical 
probability such that F(p-) = 1/2 and assume without loss of generality that 
De < 1/2. Fix0 < € < 1/4 and let 


log(1/ pe 
= poya EUa, 
logn 


where B > 0 is a certain universal constant. Then assuming n < 1/2, 


F(p: (1 —n)) <€, F(pe-A+n))=1-e. 


Proof. Let p be in the range pe - (1 + n). By the assumption n < 1/2 we also 
have 5 Pex ps 3 De < 3, It follows that the quantity o? In(e/o7) in the KKL 
corollary (10.8) is within a universal constant factor of pe log(1/p,). Thus for 
all p in the range pe - (1 +n) we obtain 


F'(p) > Var[f] - Q( - logn. 


AAL 
Pe log(l/pe) 


Using Var[ f] = 4F(p)(1 — F(p)), the definition of n, and a suitable choice 
of B, this is equivalent to 


F'(p) 


2ln(1/2 
> OLO FN _ F(p)). (10.9) 


Cc 


We now show that (10.9) implies that F(p.—np-) <«€ and leave the 
implication F(pe + npc) = 1—e€ to Exercise 10.36. For p < pe we have 
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1 — F(p) > 1/2 and hence 


In(1/2€ d F' In(1/2e 
> B029 ny a an py EP > r020, 
Npe dp F(p) Npe 

It follows that 


F'(p) 


ln F(pe — npc) < In F(p-) — ln(1/2€) = ln(1/2) — lIn(1/2€) = Ine; 


i.e., F(pe — npc) < € as claimed. 


This proof establishes that every monotone transitive-symmetric function 
with critical probability at least 1/n°™® (and at most 1 — 1/n°) has a sharp 
threshold. Unfortunately, the restriction on the critical probability can’t be 
removed. The simplest example illustrating this is the logical OR function 
OR,, : {True, False}” — {True, False} (equivalently, the graph property of con- 
taining an edge), which has critical probability pe ~ n2, Even though OR, 
is transitive-symmetric, it has constant total influence at its critical proba- 
bility, I[ORY®] ~ 21n2. Indeed, OR, doesn’t have a sharp threshold; i.e., 
it’s not true that Pr}, [OR,(x) = True] = 1 — o(1) for p = pe(1 + o(1)). For 
example, if x is drawn from the (2p,)-biased distribution we still just have 
Pr[OR,(x) = True] ~ 3/4. On the other hand, most “interesting” monotone 
transitive-symmetric functions do have a sharp threshold; in Section 10.5 we’ll 
derive a more sophisticated method for establishing this. 


10.4. More on Randomization/Symmetrization 


In Section 10.3 we collected a number of consequences of the General Hyper- 
contractivity Theorem for functions f € L?(Q”, x8”). All of these had a depen- 
dence on “2”, the least probability of an outcome under 7x . This can sometimes 
be quite expensive; for example, the KKL Theorem and its consequence The- 
orem 10.29 are trivialized when A = 1/n®®., 

However, as mentioned in Section 10.2, when working with symmetric 
random variables X, the “randomization” trick sometimes lets you reduce 
to the analysis of uniformly random +1 bits (which have à = 1/2). Further, 
Lemma 10.15 suggests a way of “symmetrizing” general mean-zero random 
variables (at least if we don’t mind applying T 1 ). In this section we will 
develop the randomization/symmetrization technique more thoroughly and see 
an application: bounding the LP —> LP norm of the “low-degree projection” 
operator. 

Informally, applying the randomization/symmetrization technique to f € 
L?(Q", z8") means introducing n independent uniformly random bits r = 
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(r1,..-,1%n) ~ {—1, 1}” and then “multiplying the ith input to f by r;”. Of 
course Q is just an abstract set so this doesn’t quite make sense. What we really 
mean is “multiplying L; f, the ith part of f’s Fourier expansion (orthogonal 
decomposition), by r;”. Let’s see some examples: 


Example 10.30. Let f : {—1, 1}” —> R be a usual Boolean function with 


Fourier expansion 
f@= >> FO] Tx. 


SC[n] ies 
Its randomization/symmetrization will be the function 
x X FX x 
fen= >, (Olas X FOr. 
SC{n] ies SC{n] 


The key observation is that for random inputs x,r ~ {—1, 1}", the random 
variables f(x) and f(r, x) are identically distributed. This is simply because 
x; is asymmetric random variable, so it has the same distribution as r;x;. 


Example 10.31. Let’s return to Examples 8.10 and 8.15 from Chapter 8.1. 
Here we had Q = {a, b, c} with x the uniform distribution, and we defined a 
certain Fourier basis {Øo = 1, $1, ¢2}. A typical f : Q? —> R here might look 
like 
f(%1, x2, x3) 
= 4—1. pæ) + $- pax) + pa) E- paa) F- oa) 
+ § Giri) p3) + g pia) pi) 
= Gy PiCe1) - ba(x2)- p33) + § P1) - b2(x2) + d2(x3)- 


The randomization/symmetrization of this function would be the following 
function f € L?({-1, 1P x 03, 183 @ 1%): 


i ipi) ri + FG 2(K1) ri + Oi (K2) r + tolx) -r — FG2("3) r3 
+ §G1(%1) - 2x3) Firs + gb1(X2) - Giles) - r2r3 
= poi) + G2(x2) + G3(x3) - rirar3 + 42x1) - $2(¥2) + Oo(x3) + r1r2F3. 


There’s no obvious way to compare the distributions of f(x) and Fi (r, x). 
However, looking carefully at Example 8.10 we see that the basis function 
Q2 has the property that ¢2(x;) is a symmetric real random variable when 
x; ~ m. In particular, r; - 62(x;) has the same distribution as @2(x;). Therefore 
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if g € L?(Q", m2") has the lucky property that its Fourier expansion happens 
to only use ¢ and never uses ¢1, then we do have that g(x) and g(r, x) are 
identically distributed. 


Let’s give a formal definition of randomization/symmetrization. 


Definition 10.32. Let f € L7(Q", z2"). The randomization/symmetrization 
of f is the function f € L7({-1, 1}" x Q”, mi Q m®") defined by 


fox) FPP Ow, (10.10) 


SC{n] 
where we recall the notation r$ = Tlie sli- 


Remark 10.33. Another way of defining fis to stipulate that for each x € Q”, 
the function fix : {—1, 1}" — R is defined to be the Boolean function whose 
Fourier coefficient on S is f=5(x). (This is more evident from (10.10) if you 
swap the positions of r° and f=%(x).) 


In light of this remark, the basic Parseval formula for Boolean functions 
implies that for all x € Q”, 


Midea 2 te), 
Scinn] 


(The notation || - ||2, emphasizes that the norm is computed with respect to the 
random inputs r.) If we take the expectation of the above over x ~ 1®", the 
left-hand side becomes || f co and the right-hand side becomes || f los by 
Parseval’s formula for L?(Q”, z2”). Thus: 


Proposition 10.34. Let f € L7(Q", 1%"). Then II fll = || fille. 


Thus randomization/symmetrization doesn’t change 2-norms. What about 
q-norms for q 4 2? As discussed in Examples 10.30 and 10.31, if f has the 
lucky property that its Fourier expansion only contains symmetric basis func- 
tions then f(r, x) and f(x) have identical distributions, so their g-norms are 
identical. The essential feature of the randomization/symmetrization technique 
is that even for general f the g-norms don’t change much — if you are willing 
to apply T, for some constant p: 


Theorem 10.35. For f € L7(Q", n2”) and q > 1, 
IT: flg < Illa < IT. fila (10.11) 
Equivalently, 


[Teg fla < WFlla < T2 fla. 
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Here 0 < cg < 1 is a constant depending only on q; in particular, we may take 


2 
C4 = C43 = 5 


The two inequalities in (10.11) are not too difficult to prove; for example, 
you might already correctly guess that the left-hand inequality follows from 
our first randomization/symmetrization Lemma 10.15 and an induction. We’ll 
give the proofs at the end of this section. But first, let’s illustrate how you 
might use them by solving the following basic problem concerning low-degree 
projections: 


Question 10.36. Let k € N, let 1 < q < œ, and let f € L?(Q",1®"). Can 
IES la be much larger than || f \|q? To put the question in reverse, suppose 
g € L*(Q", n2") has degree at most k; is it possible to make the q-norm of g 
much smaller by adding terms of degree exceeding k to its Fourier expansion? 


The question has a simple answer if q = 2: in this case we have || f=*||2 < 
|| f lz always. This follows from Paresval: 


k n 
I= OW < YO WEA = (10.12) 
j=0 j=0 
When q # 2 things are not so simple, so let’s first consider the most familiar 
setting of 2 = {—1, 1}, x = 71/2. In this case we can relate the q-norm and 
the 2-norm via the Hypercontractivity Theorem: 


Proposition 10.37. Let k € N and let g:{-1,1}" —> R. Then for q > 2 
we have |\g=*ll¢ < V= T lgllg and for 1 <q <2 we have |Ig**llq < 
(//q = Dlg. 


This proposition is an easy consequence of the Hypercontractivity Theorem 
and already appeared as Exercise 9.8. The simplest case, q = 4, follows from 
the Bonami Lemma alone: 


p kru k k 
lela < v3 lgl < V3 lgl < V3 lela- (10.13) 


Now let’s consider functions f € L?(Q", z2") on general product spaces; 
for simplicity, we’ll continue to focus on the case q = 4. One possibility is to 
repeat the above proof using the General Hypercontractivity Theorem (more 
specifically, Theorem 10.21). This would give us || f=*||4 < ane I f lla. How- 
ever, we will see that it’s possible to get a bound completely independent of à 
— i.e., independent of (Q, x) — using randomization/symmetrization. 

First, suppose we are in the lucky case described in Example 10.31 in which 
f’s Fourier spectrum only uses symmetric basis functions. In this case f Sk(x) 
and f<*(r, x) have the same distribution for any k, and we can leverage the 
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L?({—1, 1}) bound (10.13) to get the same result for f. First, 


a eS 
Ifa = Fl = | Fe Ollar 


4.x : 


For each outcome x = x, the inner function g(r) = FE) is a degree-k 
function ofr € {—1, 1}”. Therefore we can apply (10.13) with this g to deduce 


VE Fle = V3 If la 


| Fla. 


| FIOM. 


Thus we see that we can deduce (10.13) “automatically” for these luck- 
ily symmetric f, with no dependence on “A”. We’ll now show that we 
can get something similar for a completely general f using the randomiza- 
tion/symmetrization Theorem 10.35. This will cause us to lose a factor of 
(2. K, due to application of Tz and Ts ; to prepare for this, we first extend the 
calculation in (10.13) slightly. 


Lemma 10.38. Let k € N and let g : {—1, 1}" > R. Then for any 0 < p < 1, 
< k <, 
Igla < v3 lgl < (3/0 ITpgllz < (3/0 ITp8ll4- 
Proof. We have 
Š ex. 22 k k 
lg“*lla < V3 lgl < V3/e IITpgll2 < 3/0 liTpslla. 
Here the first inequality is Bonami’s Lemma and the second is because 


k k n 
lg“ = X WILE] s A/P De WIA] s 1/07) DO et WITS 


j=0 j=0 j=0 
= (1/p7)* IIT gills. 


We can now give a good answer to Question 10.36, showing that low-degree 
projection doesn’t substantially increase any g-norm: 


Theorem 10.39. Let k € N and let f € L7(Q", 12"). Then for q > 1 we have 
f= lla < c: Ilf llg. Here C4 is a constant depending only on q; in particular 
we may take C4, C4/3 = 5/3 < 9. 


Proof. We will give the proof for q = 4; the other cases are left for Exer- 
cise 10.16. Using the randomization/symmetrization Theorem 10.35, 


Ifa < ITa f=] = | ITa f$ e P)lla,r 


4,x 


For a given outcome x = x, let’s write g = Efu : {—1, 1}" — R, so that we 
have ||g=*(r)||4 on the inside above. For clarity, we remark that g is the Boolean 
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function whose Fourier coefficient on S is 2!5! f=5(x). We apply Lemma 10.38 
to this g, with pọ = L Note that T, g is then the Boolean function whose Fourier 
coefficient on S is GSI f=S x); i.e., itis Tz T . Thus we deduce 


| ITA Ola 


a $ EVDE Olar 


4,x 
= (5V3) IT; f lla < (6V3 If la, 


where the last step is the “un-randomization/symmetrization” inequality from 
Theorem 10.35. 


The remainder of this section is devoted to the proof of Theo- 
rem 10.35, which lets us compare norms of a function and its randomiza- 
tion/symmetrization. It will help to view randomization/symmetrization from 
an operator perspective. To do this, we need to slightly extend our T, notation, 
allowing for “different noise rates on different coordinates”. 


Definition 10.40. Fori € [n] and p € R, let Ti, be the operator on L7(Q”, m8”) 
defined by 


Tif =ef +0- MES HBF t+ elif =) > f+) ) f-*. 00.14) 


Szi S3i 
Furthermore, for r = (r1,..., rn) € R”, let T, be the operator on LQ", n8”) 
defined by T, = T. T? tee Trs From the third formula in (10.14) we have 

dN eee) Se a a (10.15) 


SC[n] 


EE 


operator. We remark that when r € [0, 1]” we have 


T, f(x) = E LFO Yl 


VN 0), Yn Npn An) 


These generalizations of the noise operator behave the way you would 
expect, you are referred to Exercise 8.11 for some basic properties. 
Now comparing (10.15) and (10.10) reveals the connection to randomiza- 
tion/symmetrization: 


Fact 10.41. For f € L?(Q", x8”), x € Q", andr € {—1, 1}", 
Fr, x) =T, f(x). 


In other words, randomization/symmetrization of f means applying 
T&1,+1,...+1) to f for a random choice of signs. We use this viewpoint to 
prove Theorem 10.35, which we do in two steps: 
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Theorem 10.42. Let f € LQ”, x2”). Then for any q > 1, 
IT: flax < Tr FOI g.r.x (10.16) 
forx ~ n8", r ~ {—1, 1}". In other words, IT: fla < If lla: 


Proof. In brief, the result follows from our first randomization/symmetrization 
result, Lemma 10.15, and an induction. To fill in the details, we begin by show- 
ing that if h € L?(Q, 7) is any one-input function and œ ~ z, b ~ {—1, 1}, 
then 


IT: h@)|lq0 < IIToh@)IIq.0,0- (10.17) 


This follows immediately from Lemma 10.15 because h= (x) is a mean-zero 
random variable (cf. the proof of Corollary 10.20). Next, we show that for any 
g € L?(Q", x2") and any i € [n], 


IT g@lgx < T8 Ollar x- (10.18) 


Assuming i = 1 for notational simplicity, and writing x = (x1, x’) where x’ = 
(X2,...,Xn), we have 


IT 80lgx =| IT sr laa [=| NOs Dla 


es 
(You are asked to carefully justify the second equality here in Exercise 10.10.) 
Now for each outcome of x’ we can apply (10.17) with h = gw to deduce 


| Nery gE Dla 


=|(Toe@llipnce 


q.x' 


| Steele | 

q,X 

Finally, we illustrate the first step of the induction. For distinct indices i, j, 
ITITS Olax < ITT SO larix 


by applying (10.18) with g = Tİ f. Then 
2; 


ITT, Fiene = | ITT, SMa 


’ 


= | ITT, FO. 
qfi 2 qfi 


where we used that Ti, and T}, commute. Now for each outcome of r; we can 
apply (10.18) with g = TÌ, f to get 


| ITT, Fla 


aE Eoas 
qfi 


oe T ITET Ogres: 
Thus we have shown 


ITT: Flax < ITTE FfOla rr 


Continuing the induction in the same way completes the proof. 
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To prove the “un-randomization/symmetrization” inequality in Theo- 
rem 10.35, we first establish an elementary lemma about mean-zero random 
variables: 


Lemma 10.43. Let q > 2. Then there is a small enough O < cq < 1 such that 
la — cg X|lq < lla + Xlq 


for anya € Randany random variable X satisfying E[X] = Qand ||X ||q < ©. 
In particular we may take c4 = A 


Proof. We will only prove the statement for q = 4; you are asked to establish 
the general case in Exercise 10.13. By homogeneity we may assume a = 1; 
then raising the inequality to the 4th power we need to show 


E[(1 — cX)*] < E[( + X] 


for small enough c. Expanding both sides and using E[X ] = 0, this is equivalent 
to 


E[(1 — c*)X* + (44 4c3)X? + (6 — 6c?)X?] > 0. (10.19) 
It suffices to find c such that 


(1 —c*)x? + (44 4c7)x + (6 — 6c?) >0 Vx ER; (10.20) 


then we can multiply this inequality by x? 


and take expectations to 
obtain (10.19). This last problem is elementary, and Exercise 10.14 asks you 
to find the largest c that works (the answer is c ~ .435). To see that c = 2 
suffices, we use the fact that x > —ix? — 5 for all x (because the difference 
of the left- and right-hand sides is 4 (4x +9)*). Putting this into (10.20), it 


remains to ensure 
G- Bc? — c*)x? + 3 - 6c? — 3c?) >0 Wr ER, 


and when c = 2 this is the trivially true statement agar? + “5 >0. 


Theorem 10.44. Let f € L?(Q", m2"). Then for any q > 1, 
[Teg flar < IFO a,x 


forx ~ n8", r ~ {—1, 1}". In other words, ITe fll < |I fllg. Here0 < cq < 1 


is a constant depending only on q; in particular we may take c4, c4j3 = 4. 


N 


Proof. In fact, we can show that for every outcome r = r € {—1, 1}” we have 


[Tegr FOO Mlg.x < IFOla,x 
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for sufficiently small cy > 0. Note that on the left-hand side we have 


Tie, Tae, 0 The, flax: 


We know that T is a contraction in L4 for any p > 0 (Exercise 8.11). Hence 
it suffices to show that TË & is a contraction in L4, i.e., that 


TL. 80a < Elax (10.21) 


for all g € L?(Q”, x8”). Similar to the proof of Theorem 10.42, it suffices to 
show 


lT- hllg < llAllg (10.22) 


for all one-input functions h € L?(Q, 7), because then (10.21) holds pointwise 
for all outcomes of x1,...,Xi—1,Xi+1,---, Xn. By Proposition 9.19, if we 
prove (10.22) for some q, then the same constant cą works for the conjugate 
Hölder index gq’; thus we may restrict attention to q > 2. Now the result follows 
from Lemma 10.43 by taking a = h=" and X = h=! (x). 


10.5. Highlight: General Sharp Threshold Theorems 


In Chapter 8.4 we described the problem of “threshold phenomena” for mono- 
tone functions f : {—1, 1}”" —> {—1, 1}. As p increases from 0 to 1, we are 
interested in whether Pry~„e [f (x) = —1] has a “sharp threshold”, jumping 
quickly from near 0 to near 1 around the critical probability p = pe. The “sharp 
threshold principle” tells us that this occurs (roughly speaking) if and only if the 
total influence of f under its critical distribution, I[ f°], is O(1). (See Exer- 
cise 8.28 for more precise statements.) This motivates finding a characterization 
of functions with small total influence. Indeed, finding such a characterization is 
a perfectly natural question even for not-necessarily-monotone Boolean-valued 
functions f € L*(Q", m2”). 

For the usual uniform distribution on {—1, 1}", Friedgut’s Junta Theo- 
rem from Chapter 9.6 provides a very good characterization: f : {—1, 1}” > 
{—1, 1} can only have O(1) total influence if it’s (close to) an O(1)-junta. By the 
version of Friedgut’s Junta Theorem for general product spaces (Section 10.3), 
the same holds for Boolean-valued f € L?({-1, 1}", ae) so long as p is 
not too close to 0 or to 1. However, for p as small as 1/n®, the “junta”-size 
promised by Friedgut’s Junta Theorem may be larger than n. (Cf. the breakdown 
of Friedgut and Kalai’s sharp threshold result Theorem 10.29 for p < 1/n®.) 
This is a shame, as many natural graph properties for which we’d like to show a 
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sharp threshold — e.g., (non-)3-colorability — have p = 1/n®™. At a technical 
level, the reason for the breakdown for very small p is the dependence on the 
“A” parameter in the General Hypercontractivity Theorem. But there’s a more 
fundamental reason for its failure, as suggested by the example at the end of 
Section 10.3: Friedgut’s Junta Theorem simply isn’t true for such small p. Let’s 
give some examples: 


Example 10.45. 


The logical OR function OR, : {—1, 1}” —> {—1, 1} has critical probabil- 
ity Pe ~ n2, and its total influence at this probability is [ORY] ~ 21n2, 
a small constant. Yet it’s easy to see that under the p,-biased distribution, 
OR, is not even, say, .1-close to any junta on o(n) coordinates. (That is, for 
every o(n)-junta h, Pry. 72" [f(x) Æ h(x)] > .1.) 

As another example, consider the function f : {—1, 1}” — {—1, 1} that is 


True (—1) if and only if there exists a “run” of three consecutive —1’s in 


its input. (We allow runs to “wrap around”, thus making f a transitive- 
symmetric function.) It’s not hard to show that the critical probability for 
this f satisfies pe = @(1/n'/?). Furthermore, since f is a computable by 
a DNF of width 3, Exercise 8.26(b) shows that I[ f°] < 12, a small con- 
stant. But again, this f is not close to any o(m)-junta under the p,-biased 
distribution. A similar example is Clique; : {True, False} ©) — {True, False}, 
the graph property of containing a triangle. 


We see from these examples that for p very small, we can’t hope to show 
that low-influence functions are close to juntas. However, these counterexample 
functions still have low complexity in a weaker sense — they are computable by 
narrow DNFs. Indeed, Friedgut (Friedgut, 1999) suggests this as a characteri- 
zation: 


Friedgut’s Conjecture. There is a function w : R* x (0,1) > R* such that 
the following holds: If f : {True, False}” — {True, False} is a monotone func- 
tion, 0 < p < 1/2, andI[ f®] < K, then f is €-close under a to a monotone 
DNF of width at most w(K, €). 


The assumption of monotonicity is essential in this conjecture, see Exer- 
cise 10.38. 
Short of proving his conjecture, Friedgut managed to show: 


Friedgut’s Sharp Threshold Theorem. The above conjecture holds when f 
is a graph property. 
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This gives a very good characterization of monotone graph properties with 
low total influence, one that works no matter how small p is. Friedgut also 
extended his result to monotone hypergraph properties; this was sufficient for 
him to show that several interesting hypergraph (or hypergraph-like) properties 
have sharp thresholds — for example, the property of a random 3-uniform 
hypergraph containing a perfect matching, or the property of a random width-3 
DNF formula being a tautology. (Interestingly, for neither of these properties do 
we know precisely where the critical probability pe is; nevertheless, we know 
there is a sharp threshold around it.) Roughly speaking one needs to show that at 
the critical probability, these properties can’t be well-approximated by narrow 
DNFs because they are almost surely not determined just by “local” information 
about the (hyper)graph. This kind of deduction takes some effort in random 
graph theory and we won’t discuss it further here beyond Exercise 10.42; for a 
survey, see Friedgut (Friedgut, 2005). 

Friedgut’s proof is rather long and it relies heavily on the function being a 
graph or hypergraph property. Following Friedgut’s work, Bourgain (Bourgain, 
1999) gave a shorter proof of an alternative characterization. Bourgain’s charac- 
terization is not as strong as Friedgut’s for monotone graph properties; however, 
it has the advantage that it works for low-influence functions on any product 
probability space. (In particular, there is no monotonicity assumption since the 
domain need not be {True, False}”.) We first make a quick definition and then 
state Bourgain’s theorem. 


Definition 10.46. Let f € L?(Q",2®”") be {—1, 1}-valued. For T C [n], 
y € Q7, and t > 0, we say that the restriction yz is a t-booster if fST(y) > 
E[f] + ct. (Recall that fS7(y) = El fry 1-) In case t < 0 we say that yr is a 
t-booster if f£7(y) < E[f] — |t]. 


Bourgain’s Sharp Threshold Theorem. Let f € L?(Q", x2") be {—1, 1}- 
valued with I[f] < K. Assume Var[f] > .01. Then there is some t (either 
positive or negative) with |t| > exp(— O(K”)) such that 


Pr [AT C [n], |T| < O(K) such that xr is a t-booster| > |t|. 


Thinking of K as an absolute constant, the theorem says that for a typ- 
ical input string x, there is a large chance that it contains a constant-sized 
substring that is an ((1)-booster for f. In the particular case of mono- 
tone f € L?({True, False}”, ie) with p small, it’s not hard to deduce (Exer- 
cise 10.40) that in fact there exists a T with |T| < O(K) such that restricting all 
coordinates in T to be True increases Pro [f = True] by exp(—O(K *)). This 
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is a qualitatively weaker conclusion then what you get from Friedgut’s Sharp 
Threshold Theorem when f is a graph property with I[f] < O(1) — in that 
case, by taking T to be any of the width-O(1) terms in the approximating DNF 
one can increase Pr..e[ f = True] not just by &(1) but up to almost 1. Nev- 
ertheless, Bourgain’s theorem apparently suffices to deduce any of the sharp 
thresholds results obtainable from Friedgut’s theorem (Friedgut, 2005). For a 
very high-level sketch of how Bourgain’s theorem would apply in the case of 
3-colorability of random graphs, see Exercise 10.42. 

The last part of this section will be devoted to proving Bourgain’s 
Sharp Threshold Theorem. Before doing this, we add one more remark. 
Hatami (Hatami, 2012) has significantly generalized Bourgain’s work, estab- 
lishing the following characterization of Boolean-valued functions with low 
total influence: 


Hatami’s Theorem. Let f € L7(Q", n8”) be a {—1, 1}-valued function with 
I[f] < K. Then for every e > 0, the function f is €-close (under n®") to an 
exp(O(K?/e>))-“pseudo-junta” h : Q” —> {—1, 1}. 


The term “pseudo-junta” is defined in Exercise 10.39. A K-pseudo-junta h 
has the property that I[A] < 4K; thus Hatami’s Theorem shows that having 
O(1) total influence is essentially equivalent to being an O(1)-pseudo-junta. 
A downside of the result, however, is that being a K-pseudo-junta is not a 
“syntactic” property; it depends on the probability distribution z ®”. 


Let’s now turn to proving Bourgain’s Sharp Threshold Theorem. In fact, 
Bourgain proved the theorem as a corollary of the following main result: 


Theorem 10.47. Let (Q, 7) be a finite probability space and let f : Q” > 
{—1, 1}. Let 0 < e < 1/2 and write k = I[ f ]/e. Then for each x € Q” it’s 
possible to define a set of “notable coordinates” J, C [n] satisfying |Jy| < 
exp(O(k)) such that 


=S 2 
en XO OO | < 2e. 
StF, 


Here Fy ={S:S CJ, |S| <k}, a collection always satisfying |F,| < 
exp(O(k*)). 


You may notice that this theorem looks extremely similar to Friedgut’s 
Junta Theorem from Chapter 9.6 (and the exp(—O(U[f P) quantity in Bour- 
gain’s Sharp Threshold Theorem looks similar to the Fourier coefficient 
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lower bound in Corollary 9.32). Indeed, the only difference between Theo- 
rem 10.47 and Friedgut’s Junta Theorem is that in the latter, the “notable 
coordinates” J can be “named in advance” — they’re simply the coordinates j 
with Inf [f] = $ ss j RSP large. By contrast, in Theorem 10.47 the notable 
coordinates depend on the input x. As we will see in the proof, they are precisely 
the coordinates j such that }°.. ae =S (x)? is large. Of course, in the setting of 
f :{—1, 1} — {-1, 1} we have SS (x) = FSP for all x, so the two defi- 
nitions coincide. But in the general setting of f € L7(Q", x8”) it makes sense 
that we can’t name the notable coordinates in advance and rather have to “wait 
until x is chosen”. For example, for the OR, function as in Example 10.45, 
there are no notable coordinates to be named in advance, but once x is chosen 
the few coordinates on which x takes the value True (if any exist) will be the 
notable ones. 

The proof of Theorem 10.47 mainly consists of adding the randomiza- 
tion/symmetrization technique to the proof of Friedgut’s Junta Theorem (more 
precisely, Theorem 9.28) to avoid dependence on the minimum probability 
of x. This randomization/symmetrization is applied to what are essentially the 
key inequalities in that proof: 


2/3 4/3 2/3 
ITLL fN < Mi flys = Ii faa Mif a3 < Li flay mtf]. 


(The last inequality here is Exercise 8.10(b).) The overall proof needs one 
more minor twist: since we work on a “per-x” basis and not in expectation, 
it’s possible that the set of notable coordinates can be improbably large. (Think 
again about the example of OR,; for x ~ Tin we expect only a constant 
number of coordinates of x to be True, but it’s not always uniformly bounded.) 
This is combated using the principle that low-degree functions are “reasonable” 
(together with randomization/symmetrization). 


Proof of Theorem 10.47. By the simple “Markov argument” (see Proposi- 
tion 3.2) we have 


yf ay |= IF Se 


|S|>k |S|>k 


pEi 


Thus it suffices to define the sets J, so that 


D fe Gr tae (10.23) 


x~ on 
ISISk, SZJy 
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We’ll first define “notable coordinate” sets J’ C [n] which almost do the trick: 


Rages) fOr sts. tae. 


S3j 


(where c > 1 is a universal constant). Using this definition, the main effort of 
the proof will be to show 


ye. Fey | < e2 (10.24) 


ISI<k, SZJ; 


x~ on 


This looks better than (10.23); the only problem is that the sets J’ don’t always 
satisfy | J/| < exp(O(k)) as needed. However, “in expectation” |J{| ought not 
be much larger than 1/t = c*. Thus we introduce the event 


“J’ istoobig? 4> [JJ = ck 
(where C > c is another universal constant) and define 


J} if J, is not too big, 


= 
Ø if J, is too big. 


The last part of the proof will be to show that 


I[J, is too big]: XO S&P | < €/2. (10.25) 
0<|S|<k 


Together, (10.25) and (10.24) establish (10.23). We will first prove (10.24) and 
then prove (10.25). As a small aside, we’ll see that for both inequalities we 
could obtain a bound much less than €/2 if desired. 

To prove (10.24), we mimic the proof of Theorem 9.28 but add in random- 
ization/symmetrization. The key step is encapsulated in the following lemma. 
Note that the lemma also holds with the more natural definition g = L; f; the 
additional T 2 is to facilitate future “un-randomization/symmetrization”’. 


x~ on 


Lemma 10.48. Fix x € Q” andi € J’. Then writing g = T2L; f we have 


4/3 


ne RE Il4/3- 


[TBs < T 
Proof. Here g is the randomization/symmetrization of g, so gy = gļx(r) is a 
function on the uniform-distribution hypercube. Applying the basic (4/3, 2)- 
Hypercontractivity Theorem we have 


~ ~ ~ ~ 4/3 ~ ~ 4/3 
IT Bell < Zi lays = WSs ys)" -MBa < MFi NE -NB las: 
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But by the usual Parseval Theorem, 


l=) 2° ay Ss Oy ray = Faye, 


SC[n] S>i S>i 


the last inequality due to the assumption that i € J. 


We now establish (10.24): 


E| 2, fa)? |< 6V3/2*-E|] D2 fay 


ISI<k, SZJ SZJ 
k =S 2 
<20'-E] DDT, Fe) 
igJ) S3j 
=20 -E| SUNT gill} (org! = TL; f) 


igle 


<20't'?-E| So lig‘ ixll3/3 | (Lemma 10.48) 


id J! 

S20 . XO IL; f4 (Theorem 10.35) 
i=1 

ee, 5 Inf;[ f] (Exercise 8.10(b)) 
ial 


= 20*r!® . I[f] = (2007!) ke < €/2, 


the last inequality because (20c~!/3)'k < 1/2 for all k > 0 once c is a large 
enough constant. 
The last task in the proof is to establish (10.25). Using Cauchy—Schwarz, 


E | 1[J, istoo big): $O f(x) 


xor" 
0<|S|<k 


< JE [ILJ is too big?] E (È ser) . (10.26) 


x 
0<|S|<k 
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For the first factor on the right of (10.26) we use Markov’s inequality: 


E[1[J; is too big] ] = Pr[ J, is too big] = Pr[|J{| > C*] 


< CEI < me (oor ve) | —cC-ck If]. (10.27) 


i=l S3i 


As for the second factor on the right of (10.26), let’s write h = T: (f - FY: 
(We are being slightly finicky about f= just in case it’s very large.) Then 


2 


(S rer’) |< 6/a*-£] [Nae 


0<|S|<k SAD 
= 40° E [ix] 


< 40" - E [I4] 


<40. || f — f” (Theorem 10.35) 
<40 -Z EKF- f7] (since |f — f] < 2 always) 
= 4 - 40} . Var[ f] < 4- 40% - I[ f]. (10.28) 


Substituting (10.27) and (10.28) into (10.26) gives 


E | 1[J, is too big] - > f- S(x}? 


xno 
0<|S|<k 
<v Cek . 4. 40% - IL f] = 2048) ke < €/2, 


the last inequality again holding for all k > 0 once C is chosen large enough 
compared to c. 


We end this chapter by deducing Bourgain’s Sharp Threshold Theorem from 
Theorem 10.47. 


Proof of Bourgain’s Sharp Threshold Theorem. We take e = .001 in Theo- 
rem 10.47 and obtain the associated collections of subsets #,, where each 
|Fy| < exp(O(K’)) and each S € F, satisfies |S| < O(K). Using the fact that 
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fI = | — Var[f] < .99 for each x we get 


Y f(x? | = 1- 2e — 99 = .008. 
SEF Ø) 


x~7 on 


We always have |.¥, \ {Ø} < exp(O(K’)), and there’s also no harm in assum- 
ing |Fx \ {Ø}| > 0. It follows that 


008 
exp(O(K?)) 


E l max TE = exp(—O(K?)). 


xna | SEF ,\(0} 
Thus for each x we can define a set Sy with O < |S,| < O(K) such that 


E_[f>*(x)’] = exp(—O(K”)). (10.29) 


xor" 
By Exercise 8.19 we have | f=*(x)| < 2!%1 < 20) and hence f=% (x)? < 
exp(O(K )) always. It follows from (10.29) that we must have 
Pr, [f-*(x)° = exp(—O(K*))] = exp(—O(K”)). 


x~7 en 
We will complete the proof by showing that whenever f=% (x) > 
exp(— O(K?)) occurs, there exists T C Sẹ such that xy is a + exp(—0(K?))- 
booster for f. Thus we either have a + exp(—O(K 2))-booster with proba- 
bility at least 5 exp(— O(K 2)), ora — exp(— O(K 2)) with probability at least 
5 exp(— O(K7)); either way, the proof will be complete. 
Assume then that f=* (xy > exp(— O(K D); equivalently, 


| f-*"(x)| > exp(— O(K?)). 


Let’s now work with g = f — E[ f]. Of course g=? = f=" forall T 4 Ø; since 
S, 4 @ the above inequality tells us that | ga (x)| > exp(—O(K >)). Recall the 
formula 


ga) = DP CPS"); 


DAT CS, 


we dropped the T = term since it’s 0. As there are only 2!%l — 1 = 
exp(O(K )) terms in the above sum, we deduce there must exist some T C Sy 
with O < |T| < O(K) such that 


Ig<"(x)| > exp(—O(K°))/ exp(O(K)) = exp(—O(K”)). 


But gS’ = fST —E[f], so the above gives us |fST(x)-— E[f]| > 
exp(— O(K”)). This precisely says that xr is a + exp(— O(K7))-booster, as 
desired. 
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For a relaxation of the assumption Var[ f] > .01 in this theorem, see Exer- 
cise 10.41. 


10.1 


10.3 


10.6. Exercises and Notes 


Let X be a random variable and let 1 < r < oo. Recall that the triangle 
(Minkowski) inequality implies that for real-valued functions fi, fo, 


IAG) + AAO < AOI- + AOI. 


More generally, if w;,..., Wm are nonnegative reals and fi, ..., fm are 
real functions, then 


[wi fiX) + H Wm fn Dlr < will AAAI + + Will fn OIL 


Still more generally, if Y is a random variable independent of X and 
f(X, Y) is a (measurable) real-valued function, then it holds that 


[ELA YI], x < EUS Dlx]. 
Using this last fact, show that whenever 0 < p < q < œ, 


HI, Dlr lyx < MIX Dax lyy 


(Hint: Raise the inequality to the power of p and use r = q/p.) 


The goal of this exercise is to prove Proposition 9.15: If X and Y 
are independent (p, q, p)-hypercontractive random variables, then so is 
X+Y.Leta,beR. 

(a) First obtain 


la + p(X + ¥)llq.x.y < || la + obX + bY |p. lox . 
(b) Next, upper-bound this by 
| la + bY + pbX|lq,x Pe . 


(Hint: Exercise 10.1.) 
(c) Finally, upper-bound this by 


| la + bY + bX|l,,x lpr = |ja + b(X + Y)|lp x,y. 


Let X,,...,X, be independent (p,q, p)-hypercontractive random 
variables. Let F(x) = Z sct] F(S)x$ be an n-variate multilinear 
polynomial. Define formally the multilinear polynomial T, F(x) = 


10.4 


10.5 
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Z scm el! F(S)x5. The goal of this exercise is to show 
IT, E(X, -03 Xl < E(X, -3 Xll (10.30) 


Note that this result yields an alternative deduction of the Hypercon- 


tractivity Theorem for +1 bits from the Two-Point Inequality. A (nota- 

tionally intense) generalization of this exercise can also be used as an 

alternative inductive strategy for deducing the General Hypercontractiv- 

ity Theorem from Proposition 10.17 or Theorem 10.18. 

(a) Why is Exercise 10.2 a special case of (10.30)? 

(b) Begin the inductive proof of (10.30) by showing that the base case 
n = Q is trivial. 

(c) For the case of general n, first establish 


IT EO] < || IT, EX) + Xa T, DX px, ly x> 
where we are using the notation x’ = (x1, ...,Xn-1) F(x) = 
E(x’) + Xn D(x’), and T/, for the operator acting formally on (n — 1)- 
variate multilinear polynomials. 

(d) Complete the inductive step, using steps similar to Exer- 
cises 10.2(b),(c). (Hint: For X, a real constant, why is T E(X’) + 
X, T, D(X’) = T(E + X, DXX’)? 

This exercise is concerned with the possibility of a converse for Propo- 

sition 10.8. 

(a) In our proof of the Two-Point Inequality we used Proposition 9.19 to 
deduce that a uniform bitx ~ {—1, 1}is (p, q, e)-hypercontractivity 
if it’s (g', p', p)-hypercontractive. Why can’t we use Proposi- 
tion 9.19 to deduce this for a general random variable X? 

(b) For each 1 < p < 2, exhibit a random variable X that is (p, 2, p)- 
hypercontractive (for some p) but not (2, p’, po)-hypercontractive. 

(a) Regarding Remark 10.2, heuristically justify (in the manner of Exer- 
cise 9.24(a)) the following statement: If A, B C {—1, 1}” are con- 
centric Hamming balls with volumes exp -2) and exp —#) and 
pa < b <a (where 0 < p < 1), then 


2G 2 
Pr [xe A,ye Bl z exp (—$ jee); 


(x,y 
p-correlated 


and further, if b < pa, then Pr[x € A, y € B] ~ Pr[x € A]. Here 
you should treat p as fixed and a, b > oo. 
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(b) Similarly, heuristically justify that the Reverse Small-Set Expansion 
Theorem is essentially sharp by considering diametrically opposed 
Hamming balls. 

The goal of this exercise (and Exercise 10.7) is to prove the Reverse 

Hypercontractivity Theorem and its equivalent Two-Function version: 


Reverse Hypercontractivity Theorem. Let f : {—1, 1}” > R=° bea 
nonnegative function and let —œ0 < q < p <1. Then ||T, flia = If llp 


for0< p< JU — p)/d — q). 


Reverse Two-Function Hypercontractivity Theorem. Let f,g: 
{-1, 1}" > R? be nonnegative, let r,s <0, and assume Q0 < p < 


Jrs < 1. Then 


E, OON > Iflg: 


p-correlated 


Recall that for —oo < p < 0 and for positive functions f € L7(Q,z) 
the “norm” || f||, retains the definition E[ f P}-!/P. (The cases of p = 
—oo, p = 0, and nonnegative functions are defined by appropriate limits; 
in particular || f||_.. is the minimum of f’s values, || f ||o is the geometric 
mean of f’s values, and || f || p is O whenever f is not everywhere positive. 
We also define p’ by 5 + > = |, with 0’ = 0.) 

The Reverse Two-Function Hypercontractivity Theorem can be 
thought of as a generalization of the lesser known “reverse Holder 
inequality” in the setting of L7({—1, 1}”, mi): 


Reverse Hölder inequality. Let f € L?(Q, 2) be a positive function. 
Then for any p < 1, 


If llp = inf {EL fg]: g > 9, Ilglly = 1}. 
In particular, forr < Oand f, g > Owe have E[ fg] > || flli+r lle thi; 


(a) Show that to prove these two Reverse Hypercontractivity Theorems 
it suffices to consider the case of f, g : {—1, 1}" > R*+, i.e., strictly 
positive functions. 

(b) Show that the Reverse Two-Function Hypercontractivity Theorem is 
equivalent (via the reverse Holder inequality) to the Reverse Hyper- 
contractivity Theorem. 

(c) Reduce the Reverse Two-Function Hypercontractivity Theorem to 
then = | case. (Hint: Virtually identical to the Two-Function Hyper- 
contractivity Induction.) Further reduce to following: 
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Reverse Two-Point Inequality. Let —oo < q < p < 1 and let 


O<spsVJVQU— p)/U—q). Then ||Tofllq 2 Ilfllp for any f: 


{-1,1} > Rt. 
10.7 The goal of this exercise is to prove the Reverse Two-Point In- 
equality. 


(a) Similar to the non-reverse case, the main effort is prov- 
ing the inequality assuming that O <q < p < 1 and that p= 
/d = p)/( — q). Do this by mimicking the proof of the Two-Point 
Inequality. (Hint: You will need the inequality (1 + 1)? > 1 + 0t for 
0 > 1, and you will need to show that T is an increasing function 
ofr on [0, 1) for all j > 2.) 

(b) Extend to the case of 0 < p < V(I = p)/( — q). (Hint: Use the fact 
that for any f : {—1, 1}" > R° and —o0 < p < q < œ we have 
lfllp < If liq. You can prove this generalization of Exercise 1.13 
by reducing to the case of negative p and q to the case of positive p 
and q.) 

(c) Establish the q = —oo case of the Reverse Two-Point Inequality. 

(d) Show that the cases —oo < q < p < 0 follow by “duality”. (Hint: 
Like Proposition 9.19 but with the reverse Hélder inequality.) 

(e) Show that the cases q < 0 < p follow by the semigroup property 
of Tp. 

(f) Finally, treat the cases of p = 0 or q = 0. 


10.8 Give a simple proof of the n = 1 case of the Reverse Two-Function 
Hypercontractivity Theorem whenr = s = —1/2. (Hint: Replace f and 
g by f? and g?; then you don’t even need to assume f and g are 
nonnegative.) Can you also give a simple proof whenr = s = —1 + 1/k 
for integers k > 2? 

10.9 By selecting “r” = —p ae and “s” = —p He, prove the Reverse 
Small-Set Expansion Theorem mentioned in Remark 10.3. (Hint: The 
negative norm of a 0-1-indicator is 0, so be sure to verify no negative 
norms arise.) 

10.10 Let g € L?(Q", x2”). Writing x = (x1, x’), where x’ = (x2,...,Xn)s 
carefully justify the following identity of one-input functions: (T 1 Qin = 
To(8ıx). (Hint: You may want to refer to Exercise 8.21.) 

10.11 Prove Proposition 10.12. 

10.12 Let X be a random variable and let Y denote its symmetrization X — X’, 
where X’ is an independent copy of X. Show for any t,0 € R that 
Pr[|Y| > t] < 2Pr[|X — 6| > t/2]. 


314 10 Advanced Hypercontractivity 


10.13 The goal of this exercise is to establish Lemma 10.43. 
(a) Show that we may take c2 = 1 (and that equality holds). Henceforth 
assume g > 2. 
(b) By following the idea of our g = 4 proof, reduce to showing that 
there exists 0 < c, < 1 such that 


|1 —egx|4 + cggx < |1 +x|1— qx Vx ER. 


(c) Further reduce to showing there exists 0 < cg < 1 such that 


[1 —cgx|? +cgqx —1 2 |\l+x|?-—qx-1 


; ; Vx €R. (10.31) 


X X 


Here you should also establish that both sides are continuous func- 
tions of x € R once the value at x = 0 is defined appropriately. 

(d) Show that there exists M > 0 such that for every 0 < cg < 5, 
inequality (10.31) holds once |x| > M. (Hint: Consider the limit 
of both sides as |x| — oo.) 

(e) Argue that it suffices to show that 


q— gx — 
j1+x| qx iS 
2 


(10.32) 


Xx 


for some universal positive constant 7 > 0. (Hint: A uniform conti- 
nuity argument for (x, cg) € [-M, M] x [0, D) 

(f) Establish (10.32). (Hint: The best possible 7 is 1, but to just achieve 
some positive 7, argue using Bernoulli’s inequality that = is 
everywhere positive and then observe that it tends to 00 as |x| — 00.) 

(g) Possibly using a different argument, what is the best asymptotic 


bound you can achieve for cq? Is cg = Q( 284) possible? 
10.14 Show that the largest c for which inequality (10.20) holds is the smaller 
real root of ct — 2c? — 2c + 1 = 0, namely, c ~ .435. 
10.15 (a) Show that 1 + 6c?x? + cxt < 1 + 6x? + 4x? + x4 holds for all 
x € R when c = 1/2. (Can you also establish it for c © .5269?) 
(b) Show that if X is a random variable satisfying E[X] = Oand ||X||4 < 
oo, then ||a + 5rX|\q < |ia + X||4 for alla € R, wherer ~ {—1, 1} 
is a uniformly random bit independent of X. (Cf. Lemma 10.15.) 
(c) Establish the following improvement of Theorem 10.44 in the case 
of q = 4: for all f € LQ”, m8”), 


Ti, F@D) Mare < FOs 


(where x ~ m8”, r ~ {—1, 1}"). 


10.6. Exercises and Notes 315 


10.16 Complete the proof of Theorem 10.39. (Hint: You’ll need to rework 
Exercise 9.8 as in Lemma 10.38.) 


10.17 Prove Proposition 10.17. 


10.18 Recall from (10.5) the function pọ = p(A) defined for à € (0, 1/2) (and 
fixed g > 2) by 


= _ | exp(u/q) — exp(—u/q) _ |sinh(u/q) 
p=pay= ; ==> re 
exp(u/q’) — exp(—u/q’) sinh(u/q’) 


where u = u(A) is defined by exp(—u) = A 

(a) Show that p is an increasing function of à. (Hint: One route is to 
reduce to showing that p? is a decreasing function of u € (0, 00), 
reduce to showing that q tanh(u/q) is an increasing function of 
q € (1, œ), reduce to showing ‘anh is a decreasing function of 
r € (0, oo), and reduce to showing sinh(2r) > 2r.) 


(b) Verify the following statements from Remark 10.19: 


1 
for fixed q and à > 1/2, p > ——; 
Vq—1 


for fixed q and à > 0, p~ a 1/214, 


Also show: 


J 1 
for fixed à and q > œ, p~ = = 
sinh u \ q 


and /= > ~ 2A In(1/A) for à > 0. 
(c) Show that p > 7 ea holds for all À. 

10.19 Let (Q, 7) bea finite probability space, |Q| > 2, in which every outcome 
has probability at least à. Let 1 < p < 2 and 0 < p < 1. The goal of 
this exercise is to prove the result of Wolff (Wolff, 2007) that, subject 
to ||T, fll2 = 1, every f € L?(Q, z) that minimizes Ilf Ilp takes on at 
most two values (and there is at least one minimizing f). 

(a) We consider the equivalent problem of minimizing F(f) = || fI} 
subject to G( f) = IT, fÊ = |. Show that both F( f) and G(f) are 
@' functionals (identifying functions f with points in RÌ). 

(b) Argue from continuity that the minimum value for || f ||} subject to 
Tp f Iż = | is attained. Henceforth write fọ to denote any mini- 


mizer; the goal is to show that fọ takes on at most two values. 
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(c) Show that fo is either everywhere nonnegative or everywhere non- 
positive. (Hint: By homogeneity our problem is equivalent to max- 
imizing ||T,f|lz2 subject to || f||, = 1; now use Exercise 2.34.) 
Replacing fo by |fo| if necessary, henceforth assume fọ is non- 
negative. 

(d) Show that VF(fo) =x - pre and VG( fo) = x - 2T, fo. Here 
x -g signifies the pointwise product of functions on Q, with 7 
thought of as a function Q —> R2°, (Hint: For the latter, write 
G(f) = (Taf, f).) 

(e) Use the method of Lagrange Multipliers to show that c fj’ = T,2 fo 
for some c € R+. (Hint: You’ll need to note that V G( fo) Æ 0.) 

(f) Writing u = E[ fo], argue that each value y = f(q@) satisfies the 
equation 


cy?! = Py + (1 — pu. (10.33) 


(g) Show that (10.33) has at most two solutions for y € R*, thereby 
completing the proof that fọ takes on at most two values. (Hint: 
Strict concavity of y?~!.) 

(h) Suppose q > 2. By slightly modifying the above argument, show 
that subject to ||gl2 = 1, every g € L?(Q,z) that maximizes 
|T,gllq takes on at most two values (and there is at least one 
maximizing g). (Hint: At some point you might want to make the 
substitution g = T, f; note that g is two-valued if f is.) 

Fix 1 < p < 2and0 < à < 1/2. LetQ = {—1, l} anda = m,, meaning 

m(—1) =A, a(1) = 1 — à. The goal of this exercise is to show the result 

of Latata and Oleszkiewicz (Latata and Oleszkiewicz, 1994): the largest 

value of p for which ||T, f|l2 < || f||p holds for all f € L?(Q, 7) is as 
given in Theorem 10.18; i.e., it satisfies 


2.» _ exp(u/p’) — exp(—u/p’) 
=r= : 10.34 
hd exp(u/p) — exp(—u/p) i 


where u is defined by exp(—u) = A (Here we are using p = q’ to 
facilitate the proof; we get the (2, g)-hypercontractivity statement by 
Proposition 9.19.) 

(a) Let’s introduce the notation a = 4!/?, B = (1 — 4)!/?. Show that 


2- 2- 
„ _ QP BP — @?P BP 
a2 — B2 
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(b) Let f € L?(Q, x). Write u = E[f] and ô =D, f = fa). Our goal 
will be to show 


wrt Hr = IT yfi < IF. (10.35) 


In the course of doing this, we’ll also exhibit a nonconstant func- 
tion f that makes the above inequality sharp. Why does this establish 
that no larger value of p is possible? 

(c) Show that without loss of generality we may assume 

fie = foe 
a B 

for some —1 < y < 1. (Hint: First use Exercise 2.34 and a continuity 
argument to show that we may assume f > 0; then use homogeneity 
of (10.35).) 

(d) The left-hand side of (10.35) is now a quadratic function of y. Show 
that our r* is precisely such that 


LHS(10.35) = Ay? + C 


for some constants A, C; i.e., r* makes the linear term in y 


drop out. (Hint: Work exclusively with the a, 6 notation and 

recall from Definition 8.44 that 67 = A(1 — A) fC) - f1)? — 

a? BP( FA) — FDY.) 
(e) Compute that 

pr! — gP! 
A = 2——____.. (10.36) 
B-a 

(Hint: You’ ll want to multiply the above expression by a? + p? = 1.) 

(f) Show that 


RHS(10.35) = (1+ y + (A — y)’)?/”. 


Why does it now suffice to show (10.35) just for0 < y < 1? 


(g) Let y* = ae > 0. Show that if y = —y*, then f is a constant 


function and both sides of (10.35) are equal to ara 

(h) Deduce that both sides of (10.35) are equal to a for y = y*. Ver- 
ify that after scaling, this yields the following nonconstant function 
for which (10.35) is sharp: f(x) = exp(—xu/p). 


(i) Write y = ./z for 0 < z < 1. By now we have reduced to showing 


Azt+C <(14+ 72? +A- NVS”, 
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knowing that both sides are equal when ./z = y*. Calling the expres- 
sion on the right (z), show that 


d 
z’ T A. 
(Hint: Yov’ll need a? + 6B? = 1, as well as the fact from part (h) 
that ġ(z) = we when ,/z = y*.) Deduce that we can complete 
the proof by showing that (z) is convex for z € [0, 1). 
Show that @ is indeed convex on [0,1) by showing that its 
derivative is a nondecreasing function of z. (Hint: Use the Gen- 
eralized Binomial Theorem as well as 1 < p < 2 to show that 
(1+ /z)? +(1 — ./z)? is expressible as Den b;z/ where each b; 
is positive.) 
10.21 Complete the proof of Theorem 10.18. (Hint: Besides Exercises 10.19 
and 10.20, you’ll also need Exercise 10.18(a).) 
10.22 (a) Let ® : [0, co) > R be defined by ®(x) = x In x, where we take 
Oln0 = 0. Verify that ® is a smooth, strictly convex function. 
(b) Consider the following: 
Definition 10.49. Let g € L?(Q, 7) be a nonnegative function. The 
entropy of g is defined by 


G 


wm 


Ent[g] = E [0(g(x))] — ©( E [g(x)l). 


Verify that Ent[g] > 0 always, that Ent[g] = 0 if and only if g is 
constant, and that Ent[cg] = cEnt[g] for any constant c > 0. 

(c) Suppose g is a probability density on {—1,1}" (recall Defi- 
nition 1.20). Show that Ent[g] = Dx. || npa), the Kullback— 
Leibler divergence of the uniform distribution from g (more pre- 
cisely, the distribution with density ¢). 


10.23 The goal of this exercise is to establish: 


The Log-Sobolev Inequality. Let f:{—1,1¥} —> R. Then 
5Ent[ f*] < If]. 


(a) Writing p = e~‘, the (p, 2)-Hypercontractivity Theorem tells us that 


2 2; 
[Te F1 S IF Ilex) 


for all t > 0. Denote the left- and right-hand sides as LHS(¢t), 
RHS(t). Verify that these are smooth functions of t € [0, 00) and 
that LHS(0) = RHS(0). Deduce that LHS’(0) < RHS’ (0). 
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(b) Compute LHS’(0) = —2I[ f ]. (Hint: Pass through the Fourier repre- 
sentation; cf. Exercise 2.18.) 
(c) Compute RHS'(0) = —Ent[ f 2i; thereby deducing the Log-Sobolev 
aoe (Hint: As an intermediate step, define F(t) = 
E[| f|!+°?~?] and show that RHS’(0) = F(0) In F(0) + F’(0).) 
10.24 (a) Let f : {—1, 1}" > R. Show that Ent[(1 + €f)*] ~ 2 Var[ f]e? as 
e—> 0. 
(b) Deduce the Poincaré Inequality for f from the Log-Sobolev Inequal- 
ity. 
10.25 (a) Deduce from the Log-Sobolev Inequality that for f : {—1, 1}” > 
{—1, 1} with aw = min{Pr[f = 1], Pr[ f = —1]}, 


2a In(1/a) < If f]. (10.37) 


This is off by a factor of In2 from the optimal edge-isoperimetric 
Hegan Pana 2.39. (Hint: Apply the inequality to either 5 — 
por anyi 

(b) Give a more streamlined direct derivation of (10.37) by differenti- 
ating the Small-Set Expansion Theorem. 

10.26 This exercise gives a direct proof of the Log-Sobolev Inequality. 

(a) The first step is to establish the n = 1 case. Toward this, show that 
we may assume f : {—1, 1} — R is nonnegative and has mean 1. 
(Hints: Exercise 2.14, Exercise 10.22(b).) 

(b) Thus it remains to establish +Ent[(1 + bx)?] < b? for b € [—1, 1]. 
Show that g(b) = b? — +Ent[(1 + bx)’] is pace on [—1, 1] and 
satisfies g(0) = 0, g'(0) = 0, and g”(b) = lb, > 0 for 
b e (—1, 1). Explain why this completes the proof of hen n = | case 
of the Log-Sobolev Inequality. 

(c) Show that for any two functions f}, f- : {-1,1}” —> R, 


(mma a] 


(Hint: The triangle inequality for || - ||2.) 

(d) Prove the Log-Sobolev Inequality via “induction by restrictions” (as 
described in Section 9.4). (Hint: For the right-hand side, establish 
Inf f] = E{((455)7] + SIL f41 + 41 f_1. For the left-hand side, 
apply induction, hen the n = | base case, then part (c).) 
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10.27 (a) By following the strategy of Exercise 10.23, establish the 
following: 
Log-Sobolev Inequality for general product space domains. Let 
f € L°(Q2", 2") and write = min(zr), A’ = 1 — A, exp(—u) = 4. 
Then 5oEnt[ f°] < If], where 


tanh(u /2) XÀ 
u/2 In — Ind 

(b) Show that o(A) ~ 2/In(1/A)) as A > 0. 

(c) Let f : {-1, 1} > {-1, 1} and treat {—1, 1}” as having the p- 
biased distribution as Write q = 1 — p. Show that if a= 
min{Pr,,[f = 1], Pr}, [f = —1]}, then 


e=Q~A)= 


A P anaa) < yf 
lnq — ln p 


and hence, for p > 0, 


alog,a <(1+o,()))p- E | [sens ¢(x)]. (10.38) 


We remark that (10.38) is known to hold without the o,(1) for all 
p < 1/2. 
10.28 Prove Theorem 10.21. (Hint: Recall Proposition 8.28.) 


10.29 Let X1, ..., X, be independent (2, q, o)-hypercontractive random vari- 
ables and let F(x) = D Sl<k F(S)x°* be an n-variate multilinear poly- 
nomial of degree at most k. Show that 


F(X... Xwllg < A/IE,- Xall. 


(Hint: Yov’ll need Exercise 10.3.) 

10.30 Let O < à < 1/2 and let (Q, zr) be a finite probability space in which 
some outcome wọ € Q has 7z (wo) = A. (For example, Q = {—1, 1}, 7 = 
m.) Define f € L?(Q, 7) by setting f (wo) = 1, f (œ) = 0 for w £ wo. 
For q > 2, compute || f|l,/Il fllz and deduce (in light of the proof of 
Theorem 10.21) that Corollary 10.20 cannot hold for p > 41/714, 

10.31 Prove Theorem 10.22. 

10.32 Prove Theorem 10.23. 

10.33 Prove Theorem 10.24. (Hint: Immediately worsen q — 1 to q so that 
finding the optimal choice of q is easier.) 

10.34 Prove Theorem 10.25. 

10.35 Prove Friedgut’s Junta Theorem for general product spaces as stated in 
Section 10.3. 


10.36 


10.37 
10.38 


10.39 


10.40 
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Show that (10.9) implies F(pe + npe) = 1 — € in the proof of Theo- 

rem 10.29. (Hint: Consider T ln(1 — F(p)).) 

Justify the various calculations and observations in Example 10.45. 

(a) Let p = 1 and let f € L?({-1, 1}", we") be any Boolean-valued 
function. Show that I[ f] < 4. (Hint: Proposition 8.45.) 

(b) Let us specialize to the case f = Xn]. Show that f is not .1-close to 
any width-O(1) DNF (under the + -biased distribution, for n suffi- 
ciently large). This shows that the assumption of monotonicity can’t 
be removed from Friedgut’s Conjecture. (Hint: Show that fixing any 
constant number of coordinates cannot change the bias of Xin] very 
much.) 


A function h : Q” —> ÈX is said to expressed as a pseudo-junta if the 
following hold: There are “juntas” fi, ..., fin : Q” — {True, False} with 
domains Ji, ..., Jm C [n] respectively. Further, g : (Q U {x})” > È, 
where * is a new symbol not in Q. Finally, for each input x € Q” we 
have h(x) = g(y), where for j € [n], 


x; ifj € Jj for some i with fj(x) = True, 
MS 
x else. 
An alternative explanation is that on input x, the junta f; decides whether 
the coordinates in its domain are “notable”; then, A(x) must be deter- 
mined based only on the set of all notable coordinates. Finally, if x is a 
distribution on Q, we say that the pseudo-junta has width-k under n ®" if 


EHU yj A <k; 


in other words, the expected number of notable coordinates is at 
most k. For h € L?(Q",7®") we simply say that h is a k-pseudo- 
junta. Show that if such a k-pseudo-junta h is {—1, 1}-valued, 
then I[f] < 4k. (Hint: Referring to the second statement in 
Proposition 8.24, consider the notable coordinates for both x and 
E S Wig Mi MiG ee La) 

Establish the following further consequence of Bourgain’s Sharp 
Threshold Theorem: Let f : {True, False}” — {True, False} be a 
monotone function with I[ f]< K. Assume Var[f] > .01 and 
0 < p < exp(—cK7), where c is a large universal constant. Then there 
exists T C [n] with |T| < O(K) such that 


Pr [ f(x) = True | x; = True for alli € T] 


X~ID 


> Pr [f(x) = Tue] + exp(—O(K”)). 
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(Hint: Bourgain’s Sharp Threshold Theorem yields a booster either 
toward True or toward False. In the former case you’re easily done; to 
rule out the latter case, use the fact that p|T| « exp(— O(K”)).) 
Suppose that in Bourgain’s Sharp Threshold Theorem we drop 
the assumption that Var[f] > .01. (Assume at least that f is 
nonconstant.) Show that there is some t with |r| > stddev[f]- 
exp(— Od f]?/ Var[ f1) such that 


I 
Pr [AT C [n], |T|< O Lf] such that x7 is a t-booster] > |t|. 
xaren Var[ f] 


(Cf. Exercise 9.32.) 


In this exercise we give the beginnings of the idea of how Bourgain’s 

Sharp Threshold Theorem can be used to show sharp thresholds for 

interesting monotone properties. We will consider —3Col, the property 

of a random v-vertex graph G ~ &v, p) being non-3-colorable. 

(a) Prove that the critical probability pe satisfies pe < O(1/v); 
i.e., establish that there is a universal constant C such that 
Pr[G ~ Gv, C/v) is 3-colorable] = 0,(1). (Hint: Union-bound 
over all potential 3-colorings.) 

(b) Toward showing (non-)3-colorability has a sharp threshold, suppose 
the property had constant total influence at the critical probability. 
Bourgain’s Sharp Threshold Theorem would imply that there is a t of 
constant magnitude such that for G ~ (v, pc), there is a |t | chance 
that G contains a t-boosting induced subgraph Gr. There are two 
cases, depending on the sign of T. It’s easy to rule out that the boost 
is in favor of 3-colorability; the absence of a few edges shouldn’t 
increase the probability of 3-colorability by much (cf. Exer- 
cise 10.41). On the other hand, it might seem plausible that the pres- 
ence of a certain constant number of edges should boost the proba- 
bility of non-3-colorability by a lot. For example, the presence of a 
4-clique immediately boosts the probability to 1. However, the point 
is that at the critical probability it is very unlikely that G contains a 
4-clique (or indeed, any “local” witness to non-3-colorability). Short 
of showing this, prove at least that the expected number of 4-cliques 
in G ~ Av, p) is 0,(1) unless p = Q(v~7/3) > pe. 


Notes 


As mentioned, the standard template introduced by Bonami (Bonami, 1970) for proving 
the Hypercontractivity Theorem for +1 bits is to first prove the Two-Point Inequality, and 
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then do the induction described in Exercise 10.3. Bonami’s original proof of the Two- 
Point Inequality reduced to the 1 < p < q < 2 case as we did, but then her calculus was a 
little more cumbersome. We followed the proof of the Two-Point Inequality appearing in 
Janson (Janson, 1997). Our use of two-function hypercontractivity theorems to facilitate 
induction and avoid the use of Exercise 10.1 is nontraditional; it was inspired by Mossel 
et al. (Mossel et al., 2006), Barak et al. (Barak et al., 2012), and Kauers et al. (Kauers 
et al., 2013). The other main approach for proving the Hypercontractivity Theorem 
is to derive it from the Log-Sobolev Inequality (see Exercise 10.23), as was done by 
Gross (Gross, 1975). 

We are not aware of the Generalized Small-Set Expansion Theorem appearing previ- 
ously in the literature; however, in a sense it’s almost identical to the Reverse Small-Set 
Expansion Theorem, which is due to Mossel et al. (Mossel et al., 2006). The Reverse 
Hypercontractivity Inequality itself is due to Borell (Borell, 1982); the presentation in 
Exercises 10.6-10.9 follows Mossel et al. (Mossel et al., 2006). For more on reverse 
hypercontractivity, including the very surprising fact that the Reverse Hypercontractiv- 
ity Inequality holds with no change in constants for every product probability space, see 
Mossel, Oleszkiewicz, and Sen (Mossel et al., 2012). 

As mentioned in Chapter 9 the definition of a hypercontractive random variable 
is due to Krakowiak and Szulga (Krakowiak and Szulga, 1988). Many of the basic 
facts from Section 10.2 (and also Exercise 10.2) are from this work and the earlier 
work of Borell (Borell, 1984); see also various other works (Kwapien and Woy- 
czyński, 1992; Janson, 1997; Szulga, 1998; Mossel et al., 2010). As mentioned, the 
main part of Theorem 10.18 (the case of biased bits) is essentially from Latata and 
Oleszkiewicz (Latata and Oleszkiewicz, 1994); see also Oleszkiewicz (Oleszkiewicz, 
2003). Our Exercise 10.20 fleshes out (and slightly simplifies) their computations but 
introduces no new idea. Earlier works (Bourgain et al., 1992; Talagrand, 1994; Friedgut 
and Kalai, 1996; Friedgut, 1998) had established forms of the General Hypercon- 
tractivity Theorem for -biased bits, giving as applications KKL-type theorems in 
this setting with the correct asymptotic dependence on à. We should also mention 
that the sharp Log-Sobolev Inequality for product space domains (mentioned in Exer- 
cise 10.27) was derived independently of the Latata—Oleszkiewicz work by Higuchi 
and Yoshida (Higuchi and Yoshida, 1995) (without proof), by Diaconis and Saloff- 
Coste (Diaconis and Saloff-Coste, 1996) (with proof), and possibly also by Oscar 
Rothaus (see (Bobkov and Ledoux, 1998)). Unlike in the case of uniform +1 bits, it’s 
not known how to derive Latata and Oleszkiewicz’s optimal biased hypercontractive 
inequality from the optimal biased Log-Sobolev Inequality. 

Kahane (Kahane, 1968) has been credited with pioneering the randomiza- 
tion/symmetrization trick for random variables. The entirety of Section 10.4 is due 
to Bourgain (Bourgain, 1979), though our presentation was significantly informed by 
the expertise of Krzysztof Oleszkiewicz (and our proof of Lemma 10.43 is slightly 
different). Like Bourgain, we don’t give any explicit dependence for the constant C, 
in Theorem 10.39; however, Kwapien (Kwapien, 2010) has shown that one may take 
Cy = Cy = O(qg/ log q) for g > 2. Our proof of Bourgain’s Theorem 10.47 follows the 
original (Bourgain, 1999) extremely closely, though we also valued the easier-to-read 
version of Bal (Bal, 2013). 

The biased edge-isoperimetric inequality (10.38) from Exercise 10.27 was proved 
by induction on n, without the additional o,(1) error, by Russo (Russo, 1982) (and 
also independently by Kahn and Kalai (Kahn and Kalai, 2007)). We remark that this 
work and the earlier (Russo, 1981) already contain the germ of the idea that monotone 
functions with small influences have sharp thresholds. Regarding the sharp threshold 
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for 3-colorability discussed in Exercise 10.42, Alon and Spencer (Alon and Spencer, 
2008) contains a nice elementary proof of the fact that at the critical probability for 
3-colorability, every subgraph on ev vertices is 3-colorable, for some universal € > 0. 
The existence of a sharp threshold for k-colorability was proven by Achlioptas and 
Friedgut (Achlioptas and Friedgut, 1999), with Achlioptas and Naor (Achlioptas and 
Naor, 2005) essentially determining the location. 


11 


Gaussian Space and Invariance Principles 


The final destination of this chapter is a proof of the following theorem due 
to Mossel, O’Donnell, and Oleszkiewicz (Mossel et al., 2005b, 2010), first 
mentioned in Chapter 5.2: 


Majority Is Stablest Theorem. Fix po € (0, 1). Let f : {—1, 1}" > [-1, 1] 
have E[ f] = 0. Then, assuming MaxInf[f] < €, or more generally that f has 
no (€, €)-notable coordinates, 


Stab,[f] < 1 — 2 arccos p + o0;(1). 


This bound is tight; recalling Theorem 2.45, the bound 1 — 2 arccos p is 
achieved by taking f = Maj„, the volume- 5 Hamming ball indicator, for 
n — oo. More generally, in Section 11.7 we’ll prove the General-Volume 
Majority Is Stablest Theorem, which shows that for any fixed volume, “Ham- 
ming ball indicators have maximal noise stability among small-influence 
functions”. 

There are two main ideas underlying this theorem. The first is that “functions 
on Gaussian space” are a special case of small-influence Boolean functions. 
In other words, a Boolean function may always be a “Gaussian function in 
disguise”. This motivates analysis of Gaussian functions, the topic introduced 
in Sections 11.1 and 11.2. It also means that a prerequisite for proving the 
(General-Volume) Majority Is Stablest Theorem is proving its Gaussian special 
cases, namely, Borell ’s Isoperimetric Theorem (Section 11.3) and the Gaussian 
Isoperimetric Inequality (Section 11.4). In many ways, working in the Gaussian 
setting is nicer because tools like rotational symmetry and differentiation are 
available. 

The second idea is the converse to the first: In Section 11.6 we prove 
the Invariance Principle, a generalization of the Berry—Esseen Central Limit 
Theorem, which shows that any low-degree (or uniformly noise-stable) Boolean 
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function with small influences is approximable by a Gaussian function. In fact, 
the Invariance Principle roughly shows that given such a Boolean function, if 
you plug any independent mean-0, variance-1 random variables into its Fourier 
expansion, the distribution doesn’t change much. In Section 11.7 we use the 
Invariance Principle to prove the Majority Is Stablest Theorem by reducing to 
its Gaussian special case, Borell’s Isoperimetric Theorem. 


11.1. Gaussian Space and the Gaussian Noise Operator 
We begin with a few definitions concerning Gaussian space. 


Notation 11.1. Throughout this chapter we write g for the pdf of a standard 


Gaussian random variable, (z) = Tz exp(— 42’). We also write ® for its 
cdf, and ® for the complementary cdf O(t) = 1 — (t) = $(—t). We write 
z ~ N(O, 1)” to denote that z = (Z1,..., Zn) is a random vector in R” whose 


components z; are independent Gaussians. Perhaps the most important property 
of this distribution is that it’s rotationally symmetric; this follows because the 
pdf at zis ony? exp(— 5 (zt +- + zŽ)), which depends only on the length ||z I3 
of z. 


Definition 11.2. Forn € N* and1 < p < co we write L? (R”, y) for the space 
of Borel functions f : R” — R that have finite pth moment || f II; under the 
Gaussian measure (the “y” stands for Gaussian). Here for a function f on 
Gaussian space we use the notation 


Iflp= E IfI. 


z~N(0, 1)" 


All functions f : R” —> Rand sets A C R” are henceforth assumed to be Borel 
without further mention. 


Notation 11.3. When it’s clear from context that f is a function on Gaussian 
space we’ll use shorthand notation like E[ f] = E,-no,1y[f(z)]. If f = 14 is 
the 0-1 indicator of a subset A C R” we’ll also write 


vol,(A) =E[l4]= Pr [ze Al] 
z~N(0,1)" 


for the Gaussian volume of A. 


Notation 11.4. For f,g € L?(R",y) we use the inner product notation 
(f, g) = E[ fg], under which L?(R", y) is a separable Hilbert space. 
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If you’re only interested in Boolean functions f : {—1, 1}” —> {—1, 1} you 
might wonder why it’s necessary to study Gaussian space. As discussed at the 
beginning of the chapter, the reason is that functions on Gaussian space are 
special cases of Boolean functions. Conversely, even if you’re only interested in 
studying functions of Gaussian random variables, sometimes the easiest proof 
technique involves “simulating” the Gaussians using sums of random bits. Let’s 
discuss this in a little more detail. Recall that the Central Limit Theorem tells 
us that for x ~ {—1, 1}”, the distribution of m +--+- + xy) approaches 
that of a standard Gaussian as M — oo. This is the sense in which a standard 
Gaussian random variable z ~ N(O, 1) can be “simulated” by random bits. If 
we want d independent Gaussians we can simulate them by summing up M 
independent d-dimensional vectors of random bits. 


Definition 11.5. The function BitsToGaussians,, : {—1, 1} — R is defined 
by 


BitsToGaussians y(x) = JaC +- +xm). 


More generally, the function BitsToGaussians4, : {—1, 1}4” — Rf is defined 
on an input x € {—1, 1}4*”, thought of as a matrix of column vectors 
Xi, è .., XM E€ {-1, 1}4, by 


BitsToGaussians4, (x) = æ +-+-+Xy). 


Although M needs to be large for this simulation to be accurate, many of the 
results we’ve developed in the analysis of Boolean functions f : {—1,1}”“ > R 
are independent of M. A further key point is that this simulation preserves 
polynomial degree: if p(z1,..., Za) is adegree-k polynomial applied to d inde- 
pendent standard Gaussians, the “simulated version” p o BitsToGaussians4, : 
{-1, 1}¢” > Risa degree-k Boolean function. These facts allow us to transfer 
many results from the analysis of Boolean functions to the analysis of Gaussian 
functions. On the other hand, it also means that to fully understand Boolean 
functions, we need to understand the “special case” of functions on Gaussian 
space: a Boolean function may essentially be a function on Gaussian space “in 
disguise”. For example, as we saw in Chapter 5.3, there is a sense in which the 
majority function Maj, “converges” as n — oo; what it’s converging to is the 
sign function on 1-dimensional Gaussian space, sgn € L!(R, y). 

We’ll begin our study of Gaussian functions by developing the analogue 
of the most important operator on Boolean functions, namely the noise oper- 
ator T,. Suppose we take a pair of p-correlated M-bit strings (x, x’) and use 
them to form approximate Gaussians, 


y = BitsToGaussians y(x), y’ = BitsToGaussians y(x’). 
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For each M it’s easy to compute that E[ y] = EL y’] = 0, Var[y] = Var[y’] = 1, 
and E[yy’] = p. As noted in Chapter 5.2, a multidimensional version of the 
Central Limit Theorem (see, e.g., Exercises 5.33, 11.46) tells us that the joint 
distribution of (y, y’) converges to a pair of Gaussian random variables with 
the same properties. We call these p-correlated Gaussians. 


Definition 11.6. For —1 < p < 1, we say that the random variables (z, z’) 
are p-correlated (standard) Gaussians if they are jointly Gaussian and satisfy 
E[z] = E[z’] = 0, Var[z] = Var[z’] = 1, and E[zz’] = p. In other words, if 


eo 


Note that the definition is symmetric in z, z’ and that each is individually 
distributed as N(0, 1). 


Fact 11.7. An equivalent definition is to say that z = (u, 8) and z' = (Ù, 8), 
where g ~ N(O, 1)“ andi, ù € R? are any two unit vectors satisfying (ù, 0) = p. 
In particular we may choose d = 2, u = (1, 0), and ù = (p, y 1 — p?), thereby 
defining z = g, and z' = pg, +y 1 — pg». 


Remark 11.8. In Fact 11.7 it’s often convenient to write 9 = cos 0 for some 
0 € R, in which case we may define the p-correlated Gaussians as z = (u, £) 
and z’ = (v, g) for any unit vectors u, v making an angle of 0; e.g., u = (1, 0), 
v = (cos@, sin 0). 


Definition 11.9. For a fixed z € R we say random variable z’ is a Gaussian p- 
correlated to z, written z' ~ N,(z), if z' is distributed as pz + y 1 — p? g where 
g ~ N(O, 1). By Fact 11.7, if we draw z ~ N(0, 1) and then form z’ ~ N,(z), 
we obtain a p-correlated pair of Gaussians (z, z’). 


Definition 11.10. For —1 < p < 1 andn € N+ we say that the R”-valued ran- 
dom variables (z, z’) are p-correlated n-dimensional Gaussian random vectors 
if each component pair (z1, Z1), ..-, (Zn, Z4) is a p-correlated pair of Gaus- 
sians, and the n pairs are mutually independent. We also naturally extend the 
definition of z' ~ N,(z) to the case of z € R”; this means z’ = pz + y 1 — pg 
for g ~ N(O, 1)”. 


Remark 11.11. Thus, if z ~ N(0, 1)” and then z ~ N,(z’) we obtain a p- 
correlated n-dimensional pair (z, z’). It follows from this that the joint distribu- 
tion of such a pair is rotationally symmetric (since the distribution of a single 
n-dimensional Gaussian is). 


Now we can introduce the Gaussian analogue of the noise operator. 
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Definition 11.12. For p € [—1, 1], the Gaussian noise operator U, is the linear 
operator defined on the space of functions f € L'(R", y) by 
Uf@= E [f@ = E  [fiezt+vl— øg). 


z'~N,(2) g~N(0,1)" 


Fact 11.13. (Exercise 11.3.) If f € L'(R", y) is ann-variate multilinear poly- 
nomial, then U, f(z) = f (pz). 


Remark 11.14. Our terminology is nonstandard. The Gaussian noise operators 
are usually collectively referred to as the Ornstein—Uhlenbeck semigroup (or 
sometimes as the Mehler transforms). They are typically defined for p = e™ € 
[0, 1] Ge., for £ € [0, co]) by 

P f@)= E [f(e'z+vV1—e-%g)] = Uo f(z). 


g~N(0,1)" 


The term “semigroup” refers to the fact that the operators satisfy P, Pa, = Ph+n, 
i.e., Up Up = Up, p, (which holds for all p1, o2 € [—1, 1]; see Exercise 11.4). 


Before going further let’s check that U, is a bounded operator on all of 
L?(R", y) for p > 1; in fact, it’s a contraction (cf. Exercise 2.33): 


Proposition 11.15. For each p € [—1, 1] and 1 < p < œ the operator U, is 
a contraction on L?(R", y); i.e., |U, fllp < IF llp- 


Proof. The proof for p = oo is easy; otherwise, the result follows from Jensen’s 
inequality, using that t b> |t|? is convex: 


P 
Ifi = E UOS E, | „ECE i | 


< Ey | Ell | = Ifi}. 

As in the Boolean case, you should think of the Gaussian noise operator 
as having a “smoothing” effect on functions. As p goes from 1 down to 0, 
U, f involves averaging f’s values over larger and larger neighborhoods. In 
particular U, is the identity operator, U; f = f, and Uo f = E[ f], the constant 
function. In Exercises 11.5, 11.6 you are asked to verify the following facts, 
which say that for any f, as 9 > 17 we get a sequence of smooth (i.e., 6°) 
functions U, f that tend to f. 


Proposition 11.16. Let f € L'(R", y) and let —1 < p < 1. Then U,f is a 
smooth function. 
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Proposition 11.17. Let f € L'(R", y). As p > 17 we have \|U, f — fll1 > 0. 


Having defined the Gaussian noise operator, we can also make the natural 
definition of Gaussian noise stability (for which we’ll use the same notation as 
in the Boolean case): 


Definition 11.18. For f € L?(R",y) and p € [—1, 1], the Gaussian noise 
stability of f at p is defined to be 
Stab [f] = ron Boroni LEOS = (f, Up f) = Ups f). 
Z,Z )n-dimensiona 
p-correlated Gaussians 
(Here we used that (z’, z) has the same distribution as (z, z’) and hence U, is 
self-adjoint.) 


Example 11.19. Let f : R — {0, 1} be the 0-1 indicator of the nonpositive 
halfline: f = 1(_4,9). Then 


, ; 1 Larccos p 
Stab [f]=  E (f(a) f(z’)]=Priz < 0,2’ <0] =< - - l 
Gz’) p-correlated 2 2 x 


(11.1) 
with the last equality being Sheppard’s Formula, which we stated in Section 5.2 
and now prove. 


Proof of Sheppard’s Formula. Since (—z, —z’) has the same distribution as 
(z, z’), proving (11.1) is equivalent to proving 

: ; arccos p 

Pr[z < 0, z’ < 0orz >0,z' > 0] = 1 - ——_. 


The complement of the above event is the event that f(z) # f(z’) (up to 
measure 0); thus it’s further equivalent to prove 


Pr [f@ A f@ = E (11.2) 


cos 6-correlated 


for all 6 € [0, x]. As in Remark 11.8, this suggests defining z = (u, g), z’ = 
(ù, 2), where ŭi, ù € R? is some fixed pair of unit vectors making an angle of 8, 
and g ~ N(O, 1)*. Thus we want to show 
Pr [(ŭù, 8) <0& (ù, 8) > Oor vice versa] = £. 
2~N(0, 1)? m 

But this last identity is easy: If we look at the diameter of the unit circle that is 
perpendicular to g, then the event above is equivalent (up to measure 0) to the 
event that this diameter “splits” u and v. By the rotational symmetry of g, the 
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probability is evidently 6 (the angle between u, v) divided by z (the range of 
angles for the diameter). 


Corollary 11.20. Let H CR” be any halfspace (open or closed) with 
boundary hyperplane containing the origin. Let h = +1y. Then Stab,[h] = 


2 
1 — = arccos p. 


Proof. We may assume H is open (since its boundary has measure 0). By 
the rotational symmetry of correlated Gaussians (Remark 11.11), we may 
rotate H to the form H = {z € R” : zı > 0}. Then it’s clear that the noise 
stability of h = +1, doesn’t depend on n, i.e., we may assume n = 1. Thus 
h = sgn = 1 — 2f, where f = 1(_.o,9) as in Example 11.19. Now if (z, z’) 
denote p-correlated standard Gaussians, it follows from (11.1) that 


Stab [A] = E[h(z)a(z’)] = EI — 2 f(z) -2f 


= 1 — 4E[ f] + 4Stab,[ f] = 1 — 2 arccos p. 


Remark 11.21. The quantity Stab, [sgn] = 1 — 2 arccos p is also precisely 
the limiting noise stability of Maj,,, as stated in Theorem 2.45 and justified in 
Chapter 5.2. 


We’ve defined the key Gaussian noise operator U, and seen (Proposi- 
tion 11.15) that it’s a contraction on all LP (R”, y). Is it also hypercontractive? 
In fact, we’ll now show that the Hypercontractivity Theorem for uniform +1 
bits holds identically in the Gaussian setting. The proof is simply a reduction to 
the Boolean case, and it will use the following standard fact (see Janson (Janson, 
1997, Theorem 2.6) or Teuwen (Teuwen, 2012, Section 1.3) for the proof in 
case of L*; to extend to other L? you can use Exercise 11.1): 


Theorem 11.22. For eachn € N+, the set of multivariate polynomials is dense 
in L?(R", y) forall 1 < p < œ. 


Gaussian Hypercontractivity Theorem. Let f, g € L'(R", y), let r,s > 0, 
and assume 0 < p < ./rs < 1. Then 
(f, Ups) = Up f, 8) = E LEDEN < Wf lliariigtliss- 


(z,z') p-correlated 
n-dimensional Gaussians 


Proof. (We give a sketch; you are asked to fill in the details in Exercise 11.2.) 
We may assume that f € L'*’"(R", y) and g € L'*(R", y). We may also 
assume f, g € L?(R”, y) by a truncation and monotone convergence argument; 
thus the left-hand side is finite by Cauchy—Schwarz. Finally, we may assume 
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that f and g are multivariate polynomials, using Theorem 11.22. For fixed 
M e N% we consider “simulating” (z, z’) using bits. More specifically, let 
(x, x’) € {-1, 1)" x {-1, 1} be a pair p-correlated random strings and 
define the joint R”-valued random variables y, y’ by 


y = BitsToGaussians},(x), y’ = BitsToGaussians’,,(x’). 


By a multidimensional Central Limit Theorem we have that 


M—> 7 
EL f(y)g(y’)] —> E, FO. 
vara 
p-correlated 
(Since f and g are polynomials, we can even reduce to a Central Limit Theorem 
for bivariate monomials.) We further have 
r r) M> r r 
E yee Be TR ee 
z~N(0,1)" 
and similarly for g. (This can also be proven by the multidimensional Central 
Limit Theorem, or by the one-dimensional Central Limit Theorem together 
with some tricks.) Thus it suffices to show 


EL f(y)gQ)] < EL SOTI VOH Elleg y/o 


for any fixed M. But we can express f(y) = F(x) and g(y’) = G(x’) for some 
F,G : {—1, 1!" — R and so the above inequality holds by the Two-Function 
Hypercontractivity Theorem (for +1 bits). 


An immediate corollary, using the proof of Proposition 10.4, is the standard 
one-function form of hypercontractivity: 


Theorem 11.23. Let! < p < q < wandlet f € LPR", y). Then ||Up f llq < 


I fllpfor0 < p < [= 


We conclude this section by discussing the Gaussian space analogue of the 
discrete Laplacian operator. Taking our cue from Exercise 2.18 we make the 
following definition: 


Definition 11.24. The Ornstein-Uhlenbeck operator L (also called the 
infinitesimal generator of the Ornstein—Uhlenbeck semigroup, or the number 
operator) is the linear operator acting on functions f € L7(R", y) by 


d d 
Lf =—U =- U= 
fapa His 


1 dt 0 


(provided Lf exists in L?(R”, y)). Notational warning: It is common to see 
this as the definition of —L. 
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Remark 11.25. We will not be completely careful about the domain of the 
operator L in this section; for precise details, see Exercise 11.18. 


Proposition 11.26. Let f € L?(R", y) be in the domain of L, and further 
assume for simplicity that f is 6°. Then we have the formula 


LfQ@) =x: V f(x) — Af), 


where A denotes the usual Laplacian differential operator, - denotes the dot 
product, and V denotes the gradient. 


Proof. We give the proof in the case n = 1, leaving the general case to Exer- 
cise 11.7. We have 


Lf(x) = — lim EoD e + V1 = e2)] ~ f@) 


t>0+ t 


(11.3) 


Applying Taylor’s theorem to f we have 


fle'x+V1—e-%z) = flex) + fle x)v 1 — ez 
+5 feta — ez, 
where the ~ denotes that the two quantities differ by at most C(1 — e~7/)?/?|z/? 
in absolute value, for some constant C depending on f and x. Substituting this 


into (11.3) and using E[z] = 0, E[z?] = 1, and that E[|z|*] is an absolute 
constant, we get 


t t 


~x) — 1 f'(e™ — p-2t 
Lf(x) = im (2 x) FR) 5f” (ea — e N, 


-213/2 
t 


using the fact that = 
as claimed. 


— 0. But this is easily seen to be x f'(x) — f"(x), 


An easy consequence of the semigroup property is the following: 


Proposition 11.27. The following equivalent identities hold: 


d = = 
gp let =P 'Wof = PULS, 


d 
qUe S = Les f = —UerLf. 
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Proof. This follows from 


U~- f (x) — Ue- f (x) 
ô 


Ue- Ue f(x) — Ue f(x) 
m 


d ; 
OTO) = lim 


ô—>0 ô 

_ UU es f(x) — Ue f(x) 
= lim : 

56-0 ô 


We also have the following formula: 


Proposition 11.28. Let f, g € L*(R", y) be in the domain of L, and further 
assume for simplicity that they are 6°. Then 


(f, Lg) = (Lf, 8) = (V f, Vg). (11.4) 


Proof. It suffices to prove the inequality on the right of (11.4). We again 
treat only the case of n = 1, leaving the general case to Exercise 11.8. Using 
Proposition 11.26, 


(Lf, g) = [ore — f"(x))g(x)e(x) dx 
= f xosa d+ | f' (x)(g@)'(x) dx (integration by parts) 
R R 
= [sree ds + [roe ws) dx 


= [ FKEA) dx, 


using the fact that g'(x) = —xg(x). 


Finally, by differentiating the Gaussian Hypercontractivity Inequality we 
obtain the Gaussian Log-Sobolev Inequality (see Exercise 10.23; the proof is 
the same as in the Boolean case): 


Gaussian Log-Sobolev Inequality. Let f € L?(R", y) be in the domain of L. 
Then 


1Ent[ f°] < ELV f1’. 
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It’s tempting to use the notation I[ f] for E[|| Vf ||7]; however, you have to 
be careful because this quantity is not equal to }~”_, E[Var,,[f]] unless f is a 
multilinear polynomial. See Exercise 11.13. 


11.2. Hermite Polynomials 


Having defined the basic operators of importance for functions on Gaussian 
space, it’s useful to also develop the analogue of the Fourier expansion. To 
do this we’ll proceed as in Chapter 8.1, looking for a complete orthonormal 
“Fourier basis” for L7(R, y), which we can extend to L?(R”, y) by taking 
products. It’s natural to start with polynomials; by Theorem 11.22 we know 
that the collection (¢;) jen, 6;(z) = z/ is a complete basis for L7(R, y). To get 
an orthonormal (“Fourier”) basis we can simply perform the Gram-Schmidt 
process. Calling the resulting basis (A ;)jen (with “h” standing for “Hermite”), 
we get 


2 3 
zl Zz — 3z 
ho(z) = l, hi(z)= z, ho(z) = = ’ 
J2 V6 
Here, e.g., we obtained h3(z) in two steps. First, we made $3(z) = z? orthogonal 
toho,..., hz as 


h3(z) = (11.5) 


z = (2°, 1)-1— (2, z) z- (23, EA) St = 7 3z, 


where z ~ N(0, 1) and we used the fact that z? and z? - — are odd functions 


and hence have Gaussian expectation 0. Then we defined h3(z) = “3: after 


6 
determining that E[(z* — 3z)] = 6. i 
Let’s develop a more explicit definition of these Hermite polynomials. The 
computations involved in the Gram-Schmidt process require knowledge of 
the moments of a Gaussian random variable z ~ N(0, 1). It’s most convenient 
to understand these moments through the moment generating function of z, 
namely 


Efexp(tz)] = pfe dz = ee f eie dz = exp(5t°). 
(11.6) 
In light of our interest in the U, operators, and the fact that orthonormality 
involves pairs of basis functions, we’ll in fact study the moment generating 
function of a pair (z, z’) of p-correlated standard Gaussians. To compute it, 
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assume (z, z’) are generated as in Fact 11.7 with u, v unit vectors in R?. Then 


E [exp(sz + tz’)] 
z’) 
p-correlated 


E  [exp(s(uig; + u282) + t(vigı + v282))] 
g1:82~N(0,1) 
independent 


= E [exp((su;+tv E [exp((su + tv 
ert p((su; ORV E P((suz + tv2)g2)] 


exp(4(su1 + fv1)°) exp(4(su2 + tv2)°) 


exp(S||all3s7 + (a, dyst + +04 
= exp(4(s? + 2pst + t°)), 


where the third equality used (11.6). Dividing by exp(4(s? + t7)) it follows 
that 


(oe) 


12 PA I es OSP diy 
eee [exp(sz — 55°) exp(tz — 5f°)] = exp(pst) = > =s t. (11.7) 
2,2 Era A 
p-correlated j=0 


Inside the expectation above we essentially have the expression exp(tz — 51°) 
appearing twice. It’s easy to see that if we take the power series in ¢ for this 
expression, the coefficient on t/ will be a polynomial in z with leading term ae ; 
Let’s therefore write 

2 1 

— 13) = — H: (ti 
exp(tz — 30°) = D> i Hier s (11.8) 

j=0 
where H; (z) is a monic polynomial of degree j. Now substituting this into (11.7) 
yields 

D-n E HORE =Y sil, 

j,k=0 J asl j=0 J: 


Equating coefficients, it follows that we must have 


jie! ifj=k, 


pole l ifj £k. 


p-correlated 


In particular (taking p = 1), 


j! ifj=k, 
(Hj, Hy) = ==, (11.9) 
0 ifs k; 
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i.e., the polynomials (H;)jen are orthogonal. Furthermore, since H; is monic 
and of degree j, it follows that the H;’s are precisely the polynomials that 
arise in the Gram-Schmidt orthogonalization of {1, z, 2, ...}. We also see 
from (11.9) that the orthonormalized polynomials (h;)jen are obtained by 
setting hj = FH 

Let’s summarize and introduce the terminology for what we’ve deduced. 


Definition 11.29. The probabilists’ Hermite polynomials (H;)jen are the uni- 
variate polynomials defined by the identity (11.8). An equivalent definition 
(Exercise 11.9) is 

(<p) di 


— H(z). (11.10) 


Ny aa 


The normalized Hermite polynomials (hj) jen are defined by hj = F” 3 the 
first four are given explicitly in (11.5). For brevity we’ll simply refer to the h ;’s 


as the “Hermite polynomials”, though this is not standard terminology. 


Proposition 11.30. The Hermite polynomials (hj)jen form a complete 
orthonormal basis for L?(R, y). They are also a “Fourier basis”, since hy = 1. 


Proposition 11.31. For any p € [—1, 1] we have 


p! ifj=k, 


ae [AAN = (hj, Uphe) = (Uphj, he) = 0 jAk 


p-correlated 


From this “Fourier basis” for L?(R, y) we can construct a “Fourier basis” 
for L?(R”, y) just by taking products, as in Proposition 8.13. 


Definition 11.32. For a multi-index a € N” we define the (normalized multi- 
variate) Hermite polynomial hy : R” > R by 


halz) = | | ha, (z;). 
j=l 


Note that the total degree of hy is |a| =~ j &j- We also identify a subset S € [n] 
with its indicator œ defined by æ; = 1 jes; thus hs(z) denotes z5 = Ies ae 


Proposition 11.33. The Hermite polynomials (hy)yex form a complete 
orthonormal (Fourier) basis for L?(R", y). Further, for any p € [-1, 1] we 
have 


pl ifa =f, 


E he (z)hg(z')] = (ha, Uphg) = hy, hg) = 
eee [ha(Z) plz )] ( Up p) (Up p) 0 ifa + B. 


p-correlated 


338 11 Gaussian Space and Invariance Principles 


We can now define the “Hermite expansion” of Gaussian functions. 


Definition 11.34. Every f € L?(R", y) is uniquely expressible as 
f= X fos 
aeN" 


where the real numbers fia) are called the Hermite coefficients of f and the 
convergence is in L?(R", y); i.e., 


f- Yo Fhe >0 ask>oo. 


lor <k A 


This is called the Hermite expansion of f. 


Remark 11.35. If f : R” — R is a multilinear polynomial, then it “is its own 
Hermite expansion”: 


fO= DO FHA = VY F(HhsO= Yo F@hal2). 


SC[n] SC{n] Qi, Anl 


Proposition 11.36. The Hermite coefficients of f € L?(IR", y) satisfy the for- 
mula 

F(a) = (f, ha), 
and for f, g € L?(R", y) we have the Plancherel formula 


(f.8) = X faa). 
acN” 
From this we may deduce: 


Proposition 11.37. For f € L?(R”, y), the function U, f has Hermite expan- 
sion 


Ur F p N 


aeN” 


and hence 


Stab [f] = >> o fo. 


acN” 


Proof. Both statements follow from Proposition 11.36, with the first using 


U, f(a) = (U, f, ha) = X U, f(B)htp, he) = x f(B\Uphp, ha) = p! F(a); 
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we also used Proposition 11.33 and the fact that U, is acontraction in L?(R", y). 


Remark 11.38. When f : R” — R is a multilinear polynomial, this formula 
for U, f agrees with the formula f(z) given in Fact 11.13. 


Remark 11.39. In a sense it’s not very important to know the explicit formulas 
for the Hermite polynomials, (11.5), (11.8); it’s usually enough just to know 
that the formula for U, f from Proposition 11.37 holds. 


Finally, by differentiating the formula in Proposition 11.37 at p = 1 we 
deduce the following formula for the Ornstein—Uhlenbeck operator (explaining 
why it’s sometimes called the number operator): 


Proposition 11.40. For f € L?(R", y) in the domain of L we have 


Lf = X. lal fl@)he. 
ac” 
(Actually, Exercise 11.18 asks you to formally justify this and the fact that 
f is in the domain of L if and only if `, |a |? f(a)’ < 00.) For additional facts 
about Hermite polynomials, see Exercises 11.9-11.14. 


11.3. Borell’s Isoperimetric Theorem 


If we believe that the Majority Is Stablest Theorem should be true, then we 
also have to believe in its “Gaussian special case”. Let’s see what this Gaussian 
special case is. Suppose f : R” — [—1, 1] is a “nice” function (smooth, say, 
with all derivatives bounded) having E[f] = 0. You’re encouraged to think 
of f as (a smooth approximation to) the indicator +14 of some set A C R” 
of Gaussian volume vol,(A) = 5. Now consider the Boolean function 
g : {—1, 1" — {-1, 1} defined by 


g = f o BitsToGaussians},. 


Using the multidimensional Central Limit Theorem, for any p € (0, 1) we 
should have 


M—>œ 


Stab [g] “23 Stab [f], 


where on the left we have Boolean noise stability and on the right we have 
Gaussian noise stability. Using E[g] — E[f] = 0, the Majority Is Stablest 
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Theorem would tell us that 
Stab,[g] < 1 — 2 arccos p + 0, (1), 


where € = MaxInf[g]. But € = e(M) — 0 as M —> ov. Thus we should sim- 
ply have the Gaussian noise stability bound 


Stab,[f] < 1 — 2 arccos p. (11.11) 


(By a standard approximation argument this extends from “nice” f : R” —> 
[—1, 1] with E[f] = 0 to any measurable f : R” —> [—1, 1] with E[f] = 0.) 
Note that the upper bound (11.11) is achieved when f is the +1-indicator of 
any halfspace through the origin; see Corollary 11.20. (Note also that ifn = 1 
and f = sgn, then the function g is simply Maj m.) 

The “isoperimetric inequality” (11.11) is indeed true, and is a special case 
of a theorem first proved by Borell (Borell, 1985). 


Borell’s Isoperimetric Theorem (volume- case). Fix p € (0, 1). Then for 
any f € L'R’, y) with range [—1, 1] and E[ f] = 0, 


Stab, [f] < 1 — 2 arccos p, 


with equality if f is the +1-indicator of any halfspace through the origin. 


Remark 11.41. In Borell’s Isoperimetric Theorem, nothing is lost by restricting 
attention to functions with range {—1, 1}, i.e., by considering only f = +14 
for A C R”. This is because the case of range [—1, 1] follows straightforwardly 


from the case of range {—1, 1}, essentially because ,/Stab,[ f] = IU yp lle is 
a convex functional of f; see Exercise 11.25. 


More generally, Borell showed that for any fixed volume a € [0, 1], the 
maximum Gaussian noise stability of a set of volume a is no greater than that 
of a halfspace of volume a. We state here the more general theorem, using 
range {0, 1} rather than range {—1, 1} for future notational convenience (and 
with Remark 11.41 applying equally): 


Borell’s Isoperimetric Theorem. Fix p € (0, 1). Then for any f € L?(R", y) 
with range [0, 1] and E[ f] = a, 


Stab, [f] < A,(a). 


Here A,(a) is the Gaussian quadrant probability function, discussed in Exer- 
cises 5.32 and 11.19, and equal to Stab, [14] for any (every) halfspace H € R” 
having Gaussian volume vol, (H) = a. 
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We’ve seen that the volume- 4 case of Borell’s Isoperimetric Theorem is a 
special case of the Majority Is Stablest Theorem, and similarly, the general 
version of Borell’s theorem is a special case of the General-Volume Majority 
Is Stablest Theorem mentioned at the beginning of the chapter. As a conse- 
quence, proving Borell’s Isoperimetric Theorem is a prerequisite for proving 
the General-Volume Majority Is Stablest Theorem. In fact, our proof in Sec- 
tion 11.7 of the latter will be a reduction to the former. 

The proof of Borell’s Isoperimetric Theorem itself is not too hard; one 
of five known proofs, the one due to Mossel and Neeman (Mossel and Nee- 
man, 2012), is outlined in Exercises 11.26—11.29. If our main goal is just 
to prove the basic Majority Is Stablest Theorem, then we only need the 
volume- 5 case of Borell’s Isoperimetric Inequality. Luckily, there’s a very 
simple proof of this volume-5 case for “many” values of p, as we will now 
explain. 

Let’s first slightly rephrase the statement of Borell’s Isoperimetric Theorem 
in the volume-+ case. By Remark 11.41 we can restrict attention to sets; then the 
theorem asserts that among sets of Gaussian volume 4, halfspaces through the 
origin have maximal noise stability, for each positive value of p. Equivalently, 
halfspaces through the origin have minimal noise sensitivity under correlation 
cos 6, for 0 € (0, +). The formula for this minimal noise sensitivity was given 


as (11.2) in our proof of Sheppard’s Formula. Thus we have: 


Equivalent statement of the volume-} Borell Isoperimetric Theorem. Fix 
6 € (0, 5). Then for any A C R” with vol, (A) = L, 


nes [14z) # 14a] > £ 


Z m’ 


cos 6-correlated 


with equality if A is any halfspace through the origin. 


In the remainder of this section we'll show how to prove this formulation 
of the theorem whenever 6 = 5,, where £ is a positive integer. This gives 
the volume-5 case of Borell’s Isoperimetric Inequality for all o of the form 
arccos 5z, £ € N +; in particular, for an infinite sequence of p’s tending to 1. To 
prove the theorem for these values of 0, it’s convenient to introduce notation 


for the following noise sensitivity variant: 


Definition 11.42. For A C R” and ô € R (usually ô € [0, 2 ]) we write RS, (6) 
for the rotation sensitivity of A at 5, defined by 


RS4(8) = Er Hae) # la]. 


cos ĝ-correlated 


The key property of this definition is the following: 
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Theorem 11.43. For any A C R” the function RS 4(6) is subadditive; i.e., 
RS 4 (6, +- + ôe) < RS4(6,) +--+» + RS4(6¢). 
In particular, for any 8 € Rand £ € N*, 
RS4 (8) < £- RS4(6/£). 


Proof. Let g, g'~ N(0, 1)" be drawn independently and define z(6) = 
(cos 6)g + (sin @)g’. Geometrically, as 6 goes from 0 to 2 the random vectors 
z(0) trace from g to g’ along the origin-centered ellipse passing through these 
two points. The random vectors z(@) are jointly normal, with each individually 
distributed as N(O, 1)”. Further, for each fixed 6, 6’ € R the pair (z(@), z(6’)) 
constitute o-correlated Gaussians with 


p =cos6 cos 6’ + sin 0 sin 0’ = cos(6’ — 0). 


Now consider the sequence 6p, . . . , @¢ defined by the partial sums of the ô;’s, 
1e.,0; = D ôi. We get that z(09) and z(0¢) are cos(6, + --- + 6¢)-correlated, 
and that z(@;_1) and z(6;) are cos 6;-correlated for each j € [£]. Thus 


RSA (ôi + +++ + êe) = Prila(Zo)) # 1a(Z())] 


£ £ 
< È Prila) 4 1ae@j)-1))] = J RSA(5)), 
j=1 


j=l 
(11.12) 


where the inequality is the union bound. 


With this subadditivity result in hand, it’s indeed easy to prove the equiv- 

alent statement of the volume-4 Borell Isoperimetric Theorem for any 0 € 

a ns A ae ...}. As we’ll see in Section 11.7, the case of 0 = F can be used 
to give an excellent UG-hardness result for the Max-Cut CSP. 


Corollary 11.44. The equivalent statement of the volume-% Borell Isoperimet- 
ric Theorem holds whenever 6 = 5, for £ € Nt. 


Proof. The exact statement we need to show is RS4(37) = a: This follows by 
taking ô = > in Theorem 11.43 because 


RSa(Z)= Pr (la) # 140) = i 


0-correlated 


using that 0-correlated Gaussians are independent and that vol, (A) = L. 
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Remark 11.45. Although Sheppard’s Formula already tells us that equality 
holds in this corollary when A is a halfspace through the origin, it’s also 
not hard to derive this directly from the proof. The only inequality in the 
proof, (11.12), is an equality when A is a halfspace through the origin, because 
the elliptical arc can only cross such a halfspace 0 or | times. 


Remark 11.46. Suppose that A C R” not only has volume 5, it has the property 
that x € A if and only if —x ¢ A; in other words, the +1-indicator of A is an 
odd function. (In both statements, we allow a set of measure 0 to be ignored.) 
An example set with this property is any halfspace through the origin. Then 
RS4() = 1, and hence we can establish Corollary 11.44 more generally for 


any 0 <{7,5, 5,4) 5>---+J by taking ô = z in the proof. 


11.4. Gaussian Surface Area and Bobkov’s Inequality 


This section is devoted to studying the Gaussian Isoperimetric Inequality. This 
inequality is a special case of the Borell Isoperimetric Inequality (and hence 
also a special case of the General-Volume Majority Is Stablest Theorem); in 
particular, it’s the special case arising from the limit p > 17. 

Restating Borell’s theorem using rotation sensitivity we have that for any 
A CR", if H C R” is a halfspace with the same Gaussian volume as A then 
for all €, 


RS, (€) > RS4 (6). 
Since RS4 (0) = RS (0) = 0, it follows that 
RS (0%) > RS (0°). 


(Here we are considering the one-sided derivatives at 0, which can be shown 
to exist, though RS (0) may equal +00; see the notes at the end of this 
chapter.) As will be explained shortly, RS’,(0*) is precisely ./2/z - surf, (A), 
where surf, (A) denotes the “Gaussian surface area” of A. Therefore the above 
inequality is equivalent to the following: 


Gaussian Isoperimetric Inequality. Let A C R” have vol,(A) = a and let 
H CR" be any halfspace with vol,(H) = a. Then surf,(A) > surf, (ĦA). 


Remark 11.47. As shown in Proposition 11.49 below, the right-hand side 
in this inequality is equal to (æ), where Y is the Gaussian isoperimetric 
function, encountered earlier in Definition 5.26 and defined by Y = go 7!. 
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Let’s now discuss the somewhat technical question of how to properly 
define surf, (A), the Gaussian surface area of a set A. Perhaps the most natural 
definition would be to equate it with the Gaussian Minkowski content of the 
boundary 0A of A, 


vol, ({z : dist(z, 0A) < €/2}) 
> ; 


y*(@A) = limin (11.13) 
(Relatedly, one might also consider the surface integral over 0 A of the Gaussian 
pdf gy.) Under the “official” definition of surf,(A) we give below in Defini- 
tion 11.48, we’ll indeed have surf, (A) = yt(dA) whenever A is sufficiently 
nice — say, a disjoint union of closed, full-dimensional, convex sets. However, 
the Minkowski content definition is not a good one in general because it’s pos- 
sible to have yt(0A,) Æ yt (3 A2) for some sets A; and A) that are equivalent 
up to measure 0. (For more information, see Exercise 11.15 and the notes at 
the end of this chapter.) 

As mentioned above, one “correct” definition is surf,(A) = J [2 
RS',(0*). This definition has the advantage of being insensitive to measure- 
0 changes to A. To connect this unusual-looking definition with Minkowski 
content, let’s heuristically interpret RS’,(0T). We start by thinking of it as 
Boulet for “infinitesimal €”. Now RS,(e) can be thought of as the probability 
that the line segment £ joining two cos€-correlated Gaussians crosses 0A. 
Since sine © €, cose © | up to O(e?), we can think of these correlated 
Gaussians as g and g+eg’ for independent g, g’ ~ N(0, 1)”. When g 
lands near 0A, the length of £ in the direction perpendicular to dA will, 
in expectation, be € E[|N(0, 1)|] = /2/me. Thus RS,(e) should essentially 
be /2/ze - vol, ({z : dist(z, 0A) < €/2}) and we have heuristically justified 


V7/2-RS,(0*) = y7/2. lim EMO 2 yt(aA). (11.14) 


One more standard idea for the definition of surf,(A) is “E[||V14|l]”. 
This doesn’t quite make sense since 14 € L'(R", y) is not actually differ- 
entiable. However, we might consider replacing it with the limit of E[|| V fa ||] 
for a sequence (fm) of smooth functions approximating 14. To see why this 
notion should agree with the Gaussian Minkowski content y+(0A) for nice 
enough A, let’s suppose we have a smooth approximator f to 14 that agrees 
with 14 on {z : dist(z, dA) > €/2} and is (essentially) a linear function on 
{z : dist(z, 0A) < €/2}. Then ||V || will be 0 on the former set and (essen- 
tially) constantly 1/e on the latter (since it must climb from 0 to 1 over a 
distance of €). Thus we indeed have 


vol, ({z : dist(z, 0A) < €/2}) 
€ 


ELV fll] ~ y* (A), 
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as desired. We summarize the above technical discussion with the following 
definition/theorem, which is discussed further in the notes at the end of this 
chapter: 


Definition 11.48. For any A C R”, we define its Gaussian surface area to be 
surf, (A) = /7/2- RS/,(0*) € [0, ov]. 


An equivalent definition is 
f, (A) = inf } lim inf E m 5 
surf, (A) = in fimin P E llJ on) 


where the infimum is over all sequences (fn)men Of smooth fm : R” > [0, 1] 
with first partial derivatives in L?(R”, y) such that || fn — 1,||1 —> 0. Further- 
more, this infimum is actually achieved by taking fm = U,,, f for any sequence 
Pm —> 1~. Finally, the equality surf, (A) = y+ (0A) with Gaussian Minkowski 
content holds if A is a disjoint union of closed, full-dimensional, convex sets. 


To get further acquainted with this definition, let’s describe the Gaussian 
surface area of some basic sets. We start with halfspaces, which as mentioned in 
Remark 11.47 have Gaussian surface area given by the Gaussian isoperimetric 
function. 


Proposition 11.49. Let H C R” be any halfspace (open or closed) with 
vol, (H) = a € (0, 1). Then surf, (H) = Ma) = g(®-!(a)). In particular, if 

an t 4 : ae erie 
a = 1/2—-i.e., H’s boundary contains the origin — then surf, (H) = Te 
Proof. Just as in the proof of Corollary 11.20, by rotational symmetry we may 
assume H is a |-dimensional halfline, H = (—oo, t]. Since vol, (H) = a, we 
have t = ®~!(a). Then surf, (H) is equal to 
vol, ({z € R : dist(z, 0H) < 5}) 

€ 


+ pe, . 
Ponia ia 


t+e/2 
she Size Pls) ds 


= lim 


e>0t 


= g(t) = Ua). 


Here are some more Gaussian surface area bounds: 


Example 11.50. In Exercise 11.16 you are asked to generalize the above 
computation and show that if A C R is the union of disjoint nondegenerate 
intervals [t1, t2], [t3, ta], ..., [tam—1, tom] then surf, (A) = Bai y(t;). Perhaps 
the next easiest example is when A C R” is an origin-centered ball; Ball (Ball, 
1993) gave an explicit formula for surf,(A) in terms of the dimension and 


radius, one which is always less than (2 (see Exercise 11.17). This upper bound 
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was extended to non-origin-centered balls in Klivans et al. (Klivans et al., 2008). 
Ball also showed that every convex set A C R” satisfies surf,(A) < O(n"); 
Nazarov (Nazarov, 2003) showed that this bound is tight up to the constant, 
using a construction highly reminiscent of Talagrand’s Exercise 4.18. As noted 
in Klivans et al. (Klivans et al., 2008), Nazarov’s work also immediately implies 
that an intersection of k halfspaces has Gaussian surface area at most O(./log k) 
(tight for appropriately sized cubes in R*), and that any cone in R” with apex 
at the origin has Gaussian surface area at most 1. Finally, by proving the 
“Gaussian special case” of the Gotsman—Linial Conjecture, Kane (Kane, 2011) 
established that if A C R” is a degree-k “polynomial threshold function” — i.e., 
A = {z : p(z) > 0} for p an n-variate degree-k polynomial — then surf, (A) < 


ne This is tight for every k (even when n = 1). 


Though we’ve shown that the Gaussian Isoperimetric Inequality follows 
from Borell’s Isoperimetric Theorem, we now discuss some alternative proofs. 
In the special case of sets of Gaussian volume L, we can again get a very 
simple proof using the subadditivity property of Gaussian rotation sensitivity, 
Theorem 11.43. That result easily yields the following kind of “concavity 


property” concerning Gaussian surface area: 


Theorem 11.51. Let A C R”. Then for any 6 > 0, 

RS4(ô 

Jn/2- a ) < surf, (A). 
Proof. For ô > 0 and e = ô/£, £ € N+, Theorem 11.43 is equivalent to 
RS4(8) _ RS4(6) 

ô T e 
Taking £ — œ hence e — 0°, the right-hand side becomes RS (0*) = 
~ 2/7 + surf, (A). 


If we take ô = 7 /2 in this theorem, the left-hand side becomes 


a4 z te paG) #14] = 2/2/x - vol, (A)(1 — vol, (A)). 
dependent 


Thus we obtain a simple proof of the following result, which includes the 


Gaussian Isoperimetric Inequality in the volume-+ case: 


Theorem 11.52. Let A C R”. Then 
2y 2/x - vol, (A)(1 — vol,,(A)) < surf, (A). 


In particular, if vol,(A) = 5, then we get the tight Gaussian Isoperimetric 
: 1 1 
Inequality statement surf, (A) > kee Ulz). 
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As for the full Gaussian Isoperimetric Inequality, it’s a pleasing fact that 
it can be derived by pure analysis of Boolean functions. This was shown by 
Bobkov (Bobkov, 1997), who proved the following very interesting isoperi- 
metric inequality about Boolean functions: 


Bobkov’s Inequality. Let f : {—1, 1}" > [0, 1]. Then 
UEIF] < A ije IUG œ), Y f&I. (11.15) 


Here V f is the discrete gradient (as in Definition 2.34) and || - || is the usual 
Euclidean norm (in R"*'). Thus to restate the inequality, 


MELD E, f Ufa + ED rar! 


In particular, suppose f = 14 is the 0-1 indicator of a subset A C {—1, 1}". 
Then since U(0) = @(1) = 0 we obtain @ME[14]) < ELIIV 14|]. 


As Bobkov noted, by the usual Central Limit Theorem argument one can 
straightforwardly obtain inequality (11.15) in the setting of functions f € 
L?(R", y) with range [0, 1], provided f is sufficiently smooth (for example, 
if f is in the domain of L; see Exercise 11.18). Then given A C R”, by taking a 
sequence of smooth approximations to 14 as in Definition 11.48, the Gaussian 
Isoperimetric Inequality @(E[1,]) < surf, (A) is recovered. 

Given A C {—1, 1}” we can write the quantity E[||V1,||] appearing in 
Bobkov’s Inequality as 

E[||V Lali] = 4 AB p [V sensax)], (11.16) 

using the fact that for 14 : {—1, 1}” — {0, 1} we have 


D; 1a) = 1 - 1[coordinate i is pivotal for 14 on x]. 


The quantity in (11.16) — (half of) the expected square-root of the number of 
pivotal coordinates — is an interesting possible notion of “Boolean surface area” 
for sets A C {—1, 1}”. It was first essentially proposed by Talagrand (Talagrand, 
1993). By Cauchy—Schwarz it’s upper-bounded by (half of) the square-root of 
our usual notion of boundary size, average sensitivity: 


E[V1A l] < VEIY lal?) = VILA]. (11.17) 


(Note that I[14] here is actually one quarter of the average sensitivity of A, 
because we’re using 0-1 indicators as opposed to +1). But the inequality 
in (11.17) is often far from sharp. For example, while the majority function has 
average sensitivity O(,/n), the expected square-root of its sensitivity is (1) 
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because a @(1/,/n)-fraction of strings have sensitivity [n/2] and the remainder 
have sensitivity 0. 

Let’s turn to the proof of Bobkov’s Inequality. As you are asked to show 
in Exercise 11.20, the general-n case of Bobkov’s Inequality follows from the 
n = | case by a straightforward “induction by restrictions”. Thus just as in 
the proof of the Hypercontractivity Theorem, it suffices to prove the n = 1 
“two-point inequality”, an elementary inequality about two real numbers: 


Bobkov’s Two-Point Inequality. Let f : {—1, 1} — [0, 1]. Then 
MEL SI) < EUU), VAI. 


Writing f(x) =a-+ bx, this is equivalent to saying that provided a +b € 
0, 1], 


Ula) < 5\\(Wa + b), b)|| + 31a — b), d)II. 


Remark 11.53. The only property of Y used in proving this inequality is that 
it satisfies (Exercise 5.43) the differential equation Y2” = —1 on (0, 1). 


Bobkov’s proof of the two-point inequality was elementary but somewhat 
long and hard to motivate. In contrast, Barthe and Maurey (Barthe and Mau- 
rey, 2000) gave a fairly short proof of the inequality, but it used methods 
from stochastic calculus, namely It6’s Formula. We present here an elementary 
discretization of the Barthe—Maurey proof. 


Proof of Bobkov’s Two-Point Inequality. By symmetry and continuity we may 
assume ô <a—b<a+b<1-—6 for some ô >0. Let t = t(ô) > 0 bea 
small quantity to be chosen later such that b/t is an integer. Let yo, Y1, Yo, .-. 
be a random walk within [a — b, a + b] that starts at yọ = a, takes independent 
equally likely steps of +t, and is absorbed at the endpoints a + b. Finally, for 
t € N, define z; = ||(@(y,), tVt)||. The key claim for the proof is: 


Claim 11.54. Assuming t = t(6) > Ois small enough, (Z;), is a submartingale 
with respect to (y,)r, ie., E[Z:41 | Yos ---> Yil] = Elti | Yil = Ze 


Let’s complete the proof given the claim. Let T be the stopping time at 
which y, first reaches a + b. By the Optional Stopping Theorem we have 
E[Zo] < Elzr]; i.e., 


Ula) < El|\(Wezr), tVT)|II. (11.18) 


In the expectation above we can condition on whether the walk stopped at 
a+b or a—b. By symmetry, both events occur with probability 1/2 and 
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neither changes the conditional distribution of T. Thus we get 
Ua) < 3 El|\(Wa +b), tT) + 5 Elll(Wa — b), tT) 
< 5a + b), VE[t?T])|| + 4a — b), VE[v?T))I), 


with the second inequality using concavity of v > vu? + v. But it’s a well- 
known fact (following immediately from Exercise 11.22) that E[T] = (b/ ty. 
Substituting this into the above completes the proof. 

It remains to verify Claim 11.54. Actually, although the claim is true as 
stated (see Exercise 11.23) it will be more natural to prove the following 
slightly weaker claim: 


Elzi41 | yl] = z — CsT? (11.19) 


for some constant Cs depending only on 6. This is still enough to com- 
plete the proof: Applying the Optional Stopping Theorem to the submartin- 
gale (z; + Cst7t), we get that (11.18) holds up to an additive Cst? E[T] = 
C;b’t. Then continuing with the above we deduce Bobkov’s Inequality up 
to Csb*t, and we can make t arbitrarily small. 

Even though we only need to prove (11.19), let’s begin a proof of the original 
Claim 11.54 anyway. Fix t € Nt and condition on y, = y. If y isa +b, then 
the walk is stopped and the claim is clear. Otherwise, y,,, is y = t with equal 
probability, and we want to verify the following inequality (assuming t > 0 is 
sufficiently small as a function of 5, independent of y): 


IYO), tV'2) II 
< ZIO + T), tv + DI + IO — 1), 17t + DII (11.20) 


= LVEF PE, evi) + 4V2 FFE t) 


By the triangle inequality, it’s sufficient to show 


Uy) < EYU +T} +T? +1 VU — TP +T, 


and this is actually necessary too, being the t = 0 case of (11.20). (In fact, 
this is identical to Bobkov’s Two-Point Inequality itself, except now we may 
assume T is sufficiently small.) Finally, since we actually only need the weak- 
ened submartingale statement (11.19), we’ll instead establish 


Uy) — CsT? < EVU +T) +T? +AU- T+T? (11.21) 


for some constant Cs depending only on ô and for every t < 7 We do this 
using Taylor’s theorem. Write V,(t) for the function of t on the right-hand 
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side of (11.21). For any y € [a — b, a + b] the function V, is smooth on [0, 3] 


because X is a smooth, positive function on [È, 1— 2]. Thus 


V,(t) = VO) + Vi(O)t + 4V Or? + EVET? 
for some £ between 0 and t. The magnitude of V,’(&) is indeed bounded by 


some Cs depending only on ô, using the fact that X is smooth and positive on 
[S, 1- aR But V,(0) = (y), and it’s straightforward to calculate that 


Vi(0)=0, VO) = YY) + 1/%Uy) = 0, 


the last identity used the key property Y” = —1/ mentioned in Remark 11.53. 
Thus we conclude V,(t) > Wy) — Cs T’, verifying (11.21) and completing the 
proof. 


As a matter of fact, by a minor adjustment (Exercise 11.24) to this ran- 
dom walk argument we can establish the following generalization of Bobkov’s 
Inequality: 


Theorem 11.55. Let f : {—1, 1}" — [0, 1]. Then E[||(@(T,f), YT DII] is 
an increasing function of p € [0, 1]. We recover Bobkov’s Inequality by con- 
sidering p = 0, 1. 


We end this section by remarking that De, Mossel, and Neeman (De et al., 
2013) have given a “Bobkov-style” Boolean inductive proof that yields both 
Borell’s Isoperimetric Theorem and also the Majority Is Stablest Theorem 
(albeit with some aspects of the Invariance Principle-based proof appearing in 
the latter case); see Exercise 11.30 and the notes at the end of this chapter. 


11.5. The Berry—Esseen Theorem 


Now that we’ve built up some results concerning Gaussian space, we’re moti- 
vated to try reducing problems involving Boolean functions to problems involv- 
ing Gaussian functions. The key tool for this is the Invariance Principle, dis- 
cussed at the beginning of the chapter. As a warmup, this section is devoted to 
proving (a form of) the Berry—-Esseen Theorem. As discussed in Chapter 5.2, 
the Berry—Esseen Theorem is a quantitative form of the Central Limit Theorem 
for finite sums of independent random variables. We restate it here: 


Berry—Esseen Theorem. Let X,,...,X, be independent random vari- 
ables with E[X;] = 0 and Var[X;] = of, and assume ar o? = l. Let 
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S= Bem Xi and let Z ~ N(0, 1) be a standard Gaussian. Then for all 
ueER, 


|Pr[S < u] — Pr[Z < u]| < cy, 


where 


n 
3 
y= IKik3 
i=l 
and c is a universal constant. (For definiteness, c = .56 is acceptable.) 


In this traditional statement of Berry—Esseen, the error term y is a little 
opaque. To say that y is small is to simultaneously say two things: the random 
variables X; are all “reasonable” (as in Chapter 9.1); and, none is too dominant 
in terms of variance. In Chapter 9.1 we discussed several related notions of 
“reasonableness” for a random variable X. It was convenient there to use 
the definition that || X l4 is not much larger than || X \|3. For the Berry—Esseen 
Theorem it’s more convenient (and slightly stronger) to use the analogous 
condition for the 3rd moment. (For the Invariance Principle it will be more 
convenient to use (2, 3, p)- or (2, 4, o)-hypercontractivity.) The implication for 
Berry—Esseen is the following: 


Remark 11.56. In the Berry—Esseen Theorem, if all of the X;’s are “reason- 
able” in the sense that ||X; I3 < BX; I3 = Bo;, then we can use the bound 


y < B-max{o;}, (11.22) 


as this is a consequence of 


y= 5 IX; < BY o} < B-max{o;} - Yo? = B- max{o;}. 
i=1 i=1 ' i=1 ' 
(Cf. Remark 5.15.) Note that some “reasonableness” condition must hold if 
SE i X; is to behave like a Gaussian. For example, if each X; is the “unrea- 
sonable” random variable which is +,/n with probability z each and 0 other- 
wise, then S = 0 except with probability at most 1 — quite unlike a Gaussian. 


Further, even assuming reasonableness we still need a condition like (11.22) 
ensuring that no X; is too dominant (“influential”) in terms of variance. For 
example, if X; ~ {—1, 1} is a uniformly random bit and X2,..., X, = 0, then 
S = X,, which is again quite unlike a Gaussian. 


There are several known ways to prove the Berry—Esseen Theorem; for 
example, using characteristic functions (1.e., “real” Fourier analysis), or Stein’s 
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Method. We’ll use the “Replacement Method” (also known as the Lindeberg 
Method, and similar to the “Hybrid Method” in theoretical cryptography). 
Although it doesn’t always give the sharpest results, it’s a very flexible technique 
which generalizes easily to higher-degree polynomials of random variables (as 
in the Invariance Principle) and random vectors. The Replacement Method 
suggests itself as soon as the Berry—Esseen Theorem is written in a slightly 
different form: Instead of trying to show 


X,;+Xo+---+X, & Z, (11.23) 
where Z ~ N(O, 1), we’ll instead try to show the equivalent statement 
X,+Xot--- +X, x Z +Z +-+ Zn, (11.24) 


where the Z;’s are independent Gaussians with Z; ~ N(0, 7). The state- 
ments (11.23) and (11.24) really are identical, since the sum of independent 
Gaussians is Gaussian, with the variances adding. The Replacement Method 
proves (11.24) by replacing the X;’s with Z;’s one by one. Roughly speaking, 
we introduce the “hybrid” random variables 


Hp =Z +Z +X t+ Xn, 


show that H,_; ~ H, for each t € [n], and then simply add up the n errors. 
As a matter of fact, the Replacement Method doesn’t really have anything 
to do with Gaussian random variables. It actually seeks to show that 


Xi + Xot + Xn S Yi +Y ++ Yn, 


whenever X1,..., Xn, Y1,..., Yn are independent random variables with 
“matching first and second moments”, meaning E[X;] = E[Y;] and E[X?] = 
E[Y?] for eachi € [n]. (The error will be proportional to X; (|| X; P+ WY; I13).) 
Another way of putting it (roughly speaking) is that the linear form x; +---+ 
Xn is invariant to what independent random variables you substitute in for 
X1,.--,Xn, SO long as you always use the same first and second moments. The 
fact that we can take the Y;’s to be Gaussians (with Y; ~ N(E[X;], Var[X;])) 
and then in the end use the fact that the sum of Gaussians is Gaussians to derive 
the simpler-looking 


S = Ý X; ~ NEIS], Var[S]) 


i=1 


is just a pleasant bonus (and one that we’ll no longer get once we look at non- 
linear polynomials of random variables in Section 11.6). Indeed, the remainder 
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=] 1 


Ws) = 1s<u y(s) = |s| w(s) = dist,_1,1)(s) 


Figure 11.1. The test functions y used for judging Pr[Sy < u] ~ Pr[Sy < u], 
WSxlli ~ [Syl], and Efdist;_; 1)(Sx)] ~ E[dist;_; 1)(Sy)], respectively 


of this section will be devoted to showing that 
Sy =X,+---+X, is“close” to Sy =Y; +- +Y, 


whenever the X;’s and Y;’s are independent, “reasonable” random variables 
with matching first and second moments. 

To do this, we’ll first have to discuss in more detail what it means for two 
random variables to be “close”. A traditional measure of closeness between 
two random variables Sx and Sy is the “cdf-distance” used in the Berry— 
Esseen Theorem: Pr[Sy < u] © Pr[Sy < u] for every u € R. But there are 
other natural measures of closeness too. We might want to know that the 
absolute moments of Sy and Sy are close; for example, that ||Sx||; ~ || Syl. 
Or, we might like to know that Sy and Sy stray from the interval [—1, 1] 
by about the same amount: E[dist;_;,1;(Sx)] ~ Eldist;_;,1;(Sy)]. Here we are 
using: 


Definition 11.57. For any interval Ø Æ I Ç R the function dist; : R > R=° 
measures the distance of a point from Z; i.e., dist;(s) = inf„ez {|s — ul}. 


All of the closeness measures just described can be put in a common frame- 
work: they are requiring E[W(Sx)] ~ E[w(Sy)] for various “test functions” 
(or “distinguishers”) y : R > R. 

It would be nice to prove a version of the Berry—Esseen Theorem that 
showed closeness for all the test functions yw depicted in Figure 11.1, and 
more. What class of tests might we able to handle? On one hand, we can’t be 
too ambitious. For example, suppose each X; ~ {—1, 1}, each Y; ~ N(O, 1), 
and Y (s) = lsez. Then E[y(Sx)] = 1 because Sx is supported on the integers, 
but E[y(Sy)] = 0 because Sy ~ N(0, n) is a continuous random variable. On 
the other hand, there are some simple kinds of tests y for which we have exact 
equality. For example, if w(s) = s, then E[W(Sx)] = E[w(Sy)]; this is by the 
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assumption of matching first moments, E[X;] = E[Y;] for all i. Similarly, if 
w(s) = s?, then 


EIS) = E| (D> Xi)’] = DEL + OEY] 
i i ixj 
= X E[X}] + )°E[X;JE[X;] (11.25) 
i ij 


(using independence of the X;’s); similarly 


E[W(Sy)] = X ELY?] + XO EYJ ELY]; (11.26) 
i ižj 


and (11.25) and (11.26) are equal because of the matching first and second 
moment conditions. 

As a consequence of these observations we have E[y(Sx)] = E[w(Sy)] for 
any quadratic polynomial y (s) = a + bs + cs?. This suggests that to handle a 
general test y we try to approximate it by a quadratic polynomial up to some 
error; in other words, consider its 2nd-order Taylor expansion. For this to make 
sense the function y must have a continuous 3rd derivative, and the error we 
incur will involve the magnitude of this derivative. Indeed, we will now prove a 
variant of the Berry—Esseen Theorem for the class of @ test functions y with 
WY” uniformly bounded. You might be concerned that this class doesn’t contain 
any of the interesting test functions depicted in Figure 11.1. But we’ll be able 
to handle even those test functions with some loss in the parameters by using 
a simple “hack” — approximating them by smooth functions, as suggested in 
Figure 11.2. 


Invariance Principle for Sums of Random Variables. Let X,,..., Xn, 
Y,,...,Y, be independent random variables with matching \st and 2nd 
moments; i.e., ELX*] = E[¥*] fori € [n], k € {1, 2}. Write Sy = X; X; and 
Sy = 0, Y;. Then for any y : R > R with continuous third derivative, 


IE[W(Sx)] — Elw(Sy)]l < $IlW' loo: xv; 
where yxy = >-(\|Xill3 + IY). 


Proof. The proof is by the Replacement Method. For 0 < t < n, define the 
“hybrid” random variable 


H, =Y +- +Y + Xite + Xn, 
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oo 1 1 
—P 
u u-n u u+n 


y(s)= ls<u Wrls) 
—_> 
=1 1 


w(s) = disti_-1,1)(s) Wy(s) 


Figure 11.2. The step function y(s) = 1,<, can be smoothed out on the 
interval [u — n, u + n] so that the resulting function y, satisfies ||W7"lloo < 
O(1/n°). Similarly, we can smooth out yY (s) = dist;_;,4,(s) to a function Vy 
satisfying || — llo < n and ||" loo < O(1/7°). 


so Sy = Ho and Sy = H,„. Thus by the triangle inequality, 
JE[W(Sx)] — Ely(Sy)Il < Do ELH) — Ely (AI. 


t=1 


Given the definition of yyy, we can complete the proof by showing that for 
each ż € [n], 


lylo - ŒX] + ELY, 7) > [ELY (H, -1)] — Ew] 
= |E[Y (H1) — Y(H J] 


= |E[Y (U, + X) — Y(U: + Y,)]I, 
(11.27) 


where 
Ur=Yi +e +Y 1 t Xm te t+ Xn. 


Note that U, is independent of X, and Y;. We are now comparing 7f’s values 
at U; + X, and U, + Y,, with the presumption that X, and Y, are rather small 
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compared to U;. This clearly suggests the use of Taylor’s theorem: For all 
u,ô € R, 
wu +8) = Yu) + YuUS + iy (Wd? + iy u)’, 
for some u* = u*(u, ô) between u and u + ô. Applying this pointwise with 
u = U,, ô = X,, Y, yields 
WU, +X) = YU) + W'U)X: + pW" UX + ge" UX; 
WU, +Y) = YU) + y Uy, + yw" OY; + ew" UP; 


for some random variables U;, U**. Referring back to our goal of (11.27), 
what happens when we subtract these two identities and take expectations? 
The y(U,) terms cancel. The next difference is 


ELY (UXX, — ¥,)] = Ely'(U,)] - EIX, — ¥,] = E[W’(U,)] -0 = 0, 


where the first equality used that U, is independent of X, and Y,, and the second 
equality used the matching Ist moments of X, and Y,. An identical argument, 
using matching 2nd moments, shows that the shows that the difference of the 
quadratic terms disappears in expectation. Thus we’re left only with the “error 
term”: 


IE[W(U, + X) — Y(U, + YI = } |E" (UX? — w"(U*)Y7]| 
Hy” loo ŒX] + ELY), 


IA 


where the last step used the triangle inequality. This confirms (11.27) and 
completes the proof. 


We can now give a Berry—Esseen-type corollary by taking the Y;’s to be 
Gaussians: 


Variant Berry—Esseen Theorem. Jn the setting of the Berry—Esseen Theorem, 
for all @ functions y : R > R, 


IE[Y(S)] — E[y(Z)]| < 401 + LÐ" ey < 43310 lo: Y. 


Proof. Applying the preceding theorem with Y; ~ N(0, oÊ) (and hence Sy ~ 
N(O, 1)), it suffices to show that 


ver = SAX + IYI) < + 2/2)-7 = 42/2). OR. 
i=l i= 
(11.28) 


In particular, we just need to show that ||Y; I3 < 2 21X; I3 for each i. This 
holds because Gaussians are extremely reasonable; by explicitly computing 3rd 
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absolute moments we indeed obtain 


IZs13 = oF INC, DIR = 2/203 = 2y 21X16 <2/ 20x. O 


This version of the Berry—Esseen Theorem is incomparable with the standard 
version. Sometimes it can be stronger; for example, if for some reason we 
wanted to show E[cos S] ~ E[cos Z] then the Variant Berry-Esseen Theorem 
gives this with error .433y, whereas it can’t be directly deduced from the 
standard Berry—Esseen at all. On the other hand, as we’ll see shortly, we can 
only obtain the standard Berry—Esseen conclusion from the Variant version 
with an error bound of O(y!/*) rather than O(y). 

We end this section by describing the “hacks” which let us extend the Variant 
Berry—Esseen Theorem to cover certain non-@ tests y. As mentioned the idea 
is to smooth them out, or “mollify” them: 


Proposition 11.58. Let y : R —> R be c-Lipschitz. Then for any n > 0 there 
exists Yy : R > R satisfying |W — Wylloo < cn and IPE llo < Cye/n*! for 


each k € N*. Here C, is a constant depending only on k, and ye denotes the 
kth derivative of Wr: 


The proof is straightforward, taking U,(s) = na fe + ng)]; see Exer- 


cise 11.38. 
As n — O this gives a better and better smooth approximation to yf, but also 
a larger and larger value of || we loo. Trading these off gives the following: 


Corollary 11.59. In the setting of the Invariance Principle for Sums of Random 
Variables, if we merely have that y : R > R is c-Lipschitz, then 


IE[V(Sx)] — EISI] < O) yx. 


Proof. Applying the Invariance Principle for Sums of Random Variables with 
the test Y, from Proposition 11.58 we get 


|ELV,(Sx)] — ELY,(Sy)]| < O(e/n?)- yxy- 
But ||, — Yll < cn implies 
(ELP (S x)] — ElW(Sx)]| < Ell,(8x) — W(Sx)I] < en 
and similarly for Sy. Thus we get 


IELY(Sx)] — ELY (S| < O(c): (n + yxv/n") 


which yields the desired bound by taking n = pala, 2 
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Remark 11.60. It’s obvious that the dependence on c in this theorem should 
be linear in c; in fact, since we can always divide y by c it would have sufficed 
to prove the theorem assuming c = 1. 


This corollary covers all Lipschitz tests, which suffices for the functions 
w(s) = |s| and w(s) = distj_1,1;(s) from Figure 11.1. However, it still isn’t 
enough for the test Y(s) = 1,<, — i.e., for establishing cdf-closeness as in 
the usual Berry-Esseen Theorem. Of course, we can’t hope for a smooth 
approximator Vy satisfying PAQ) — ls<u| < n for all s because of the dis- 
continuity at u. However, as suggested in Figure 11.2, if we’re willing to 
exclude s € [u — n, u + 7] we can get an approximator with third derivative 
bound O(1/ n>), and thereby obtain (Exercises 11.41, 11.42): 


Corollary 11.61. In the setting of the Invariance Principle for Sums of Random 
Variables, for all u € R we have 


Pr[Sy < u — e] — e < Pr[Sy < u] < Pr[Sy <u+e]+e 
for e = O(yxi); ie., Sx and Sy have Lévy distance di (Sx, Sy) < O(yy). 
Finally, in the Berry—Esseen setting where Sy ~ N(0, 1), we can appeal to 
the “anticoncentration” of Gaussians: 


Pr[N(O, 1) < u + €] = Pr[N(0, 1) < u] + Pr[u < NO, 1) <u + €] 
< Pr[N(0, 1) <u] + ze, 


and similarly for Pr[N(0, 1) < u — €]. This lets us convert the Lévy distance 
bound into a cdf-distance bound. Recalling (11.28), we immediately deduce 
the following weaker version of the classical Berry—Esseen Theorem: 


Corollary 11.62. In the setting of the Berry-Esseen Theorem, for allu € R, 
IPr[S <u] — Pr[Z < u| < Oty"), 
where the O(-) hides a universal constant. 


Although the error bound here is weaker than necessary by a power of 1/4, this 
weakness will be more than made up for by the ease with which the Replacement 
Method generalizes to other settings. In the next section we’ll see it applied 
to nonlinear polynomials of independent random variables. Exercise 11.46 
outlines how to use it to give a Berry—Esseen theorem for sums of independent 
random vectors; as you’ll see, other than replacing Taylor’s theorem with its 
multivariate form, hardly a symbol in the proof changes. 
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11.6. The Invariance Principle 


Let’s summarize the Variant Berry—Esseen Theorem and proof from the 
preceding section, using slightly different notation. (Specifically, we’ll 
rewrite X; = ajx; where Var[x;] = 1, so a; = +0;.) We showed that if 
X1,..-,Xn, Y1,---, Y, are independent mean-0, variance-1 random variables, 
reasonable in the sense of having third absolute moment at most B, and if 
41, . . . , An are real constants assumed for normalization to satisfy )~ i a? =f 
then 


AXi bets F anXn © AY, +: Fann; 
with error bound proportional to B max{|a;|}. 


We think of this as saying that the linear form a,x; + --- + anXn is (roughly) 
invariant to what independent mean-0, variance-1, reasonable random variables 
are substituted for the x;’s, so long as all |a;|’s are “small” (compared to 
the overall variance). In this section we generalize this statement to degree- 
k multilinear polynomial forms, Xj sick 4S x5. The appropriate generalization 
of the condition that “all |a;|’s are small” is the condition that all “influences” 
oes) ay are small. We refer to these nonlinear generalizations of Berry—Esseen 
as Invariance Principles. 

In this section we'll develop the most basic Invariance Principle, which 
involves replacing bits by Gaussians for a single Boolean function f. We’ll 
show that this doesn’t change the distribution of f much provided f has small 
influences and provided that f is of “constant degree” — or at least, provided f is 
uniformly noise-stable so that it’s “close to having constant degree”. Invariance 
Principles in much more general settings are possible — for example Exer- 
cises 11.48 and 11.49 describe variants which handle several functions applied 
to correlated inputs, and functions on general product spaces. Here we’ll just 
focus on the simplest possible Invariance Principle, which is already sufficient 
for the proof of the Majority Is Stablest Theorem in Section 11.7. 

Let’s begin with some notation. 


Definition 11.63. Let F be a formal multilinear polynomial over the sequence 
of indeterminates x = (x1,..., Xn): 


F(x)= D> F(S)[ |x. 
SC[n] ieS 
where the coefficients F (S) are real numbers. We introduce the notation 


Var[F] =) > F(SÈ, bhfi[F]= ) > F(S). 


SLD Spi 
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Remark 11.64. To justify this notation, we remark that we’ ll always consider F 
applied to a sequence z = (Z1, ..., Zn) independent random variables satisfying 
E[z;] = 0, E[z?] = 1. Under these circumstances the collection of monomial 
random variables |], cs Zi is orthonormal and so it’s easy to see (cf. Section 8.2) 
that 


E[F(2)]=F@), ELF(2’]= X FSP, 


S¢[n] 


Var[F(z)] = Var[ F] = >= Fis). 
SAD 


We also have E[Var;,[F(z)]] = Inf;[F] = Des F(s), though we won’t use 
this. 


As in the Berry—Esseen Theorem, to get good error bounds we’ll need our 
random variables z; to be “reasonable”. Sacrificing generality for simplicity 
in this section, we’ll take the bounded 4th-moment notion from Definition 9.1 
which will allow us to use the basic Bonami Lemma (more precisely, Corol- 
lary 9.6): 


Hypothesis 11.65. The random variable z; satisfies E[z;] = 0, E[z?] =1, 
E[z?] = 0, and is “9-reasonable” in the sense of Definition 9.1; i.e., E[zź] <9. 


The main examples we have in mind are that each z; is either a uniform +1 
random bit or a standard Gaussian. (There are other possibilities, though; e.g., 
zi could be uniform on the interval [af 3 J3).) 

We can now prove the most basic Invariance Principle, for low-degree 
multilinear polynomials of random variables: 


Basic Invariance Principle. Let F be a formal n-variate multilinear polyno- 
mial of degree at most k € N, 


F@)= >> FO]: 
SC[n],|S|<k ieS 


Let x =(X1,...,X,) and y =(y,,..., Y) be sequences of independent ran- 
dom variables, each satisfying Hypothesis 11.65. Assume Y : R —> R is @ 
with Yll < C. Then 


ELY (F (œ))] — ELFO < 5:9 C Dae (11.29) 


Remark 11.66. The proof will be very similar to the one we used for Berry— 
Esseen except that we’ll take a 3rd-order Taylor expansion rather than a 
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2nd-order one (so that we can use the easy Bonami Lemma). As you are 
asked to show in Exercise 11.47, had we only required that y be @ and that 
the x;’s and y,’s be (2, 3, p)-hypercontractive with 2nd moment equal to 1, 
then we could obtain 


IEY (F (x))] — EFON < 2G -1/0 -XO Int, FFP”. 
t=1 
Proof. The proof uses the Replacement Method. For 0 < t < n we define 
A, = F (Jir: c03 Vp Xt41, +++ Xn), 
so F(x) = Ho and F(y) = H,,. We will show that 
JEL Hi) — WII Sy 9" + Inf, FY; (11.30) 


as in our proof of the Berry-Esseen Theorem, this will complete the proof 
after summing over f and using the triangle inequality. To analyze (11.30) 
we separate out the part of F(x) that depends on xz; i.e., we write F(x) = 
E; F(x) + x,D,; F(x), where the formal polynomials E, F and D, F are defined 
by 


E, F(x) = 5 F(S)| [x D, F(x) = » F(S) I] Xi. 


SHt ies Sat ieS\{t} 


Note that neither E; F nor D, F depends on the indeterminate x,; thus we can 
define 


U= EF (Yir os Wp. i Xt ++ Xn), 
A; = DF (Y1, -3 Vp-p ts Mths -+ Xn), 
so that 
A,-; =U,+A,;x;, H,=U,+A+y,. 
We now use a 3rd-order Taylor expansion to bound (11.30): 
WH) = YU) + WO) ax: +y UDA, + Ew" UA; x; 
+ av" Upata? 
WH) = YU) + W Oday, + 3y" UDA y, + gy" UDA y, 
+ av" UPA y 


for some random variables U* and U**. As in the proof of the Berry- 
Esseen Theorem, when we subtract these and take the expectation there are 
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significant simplifications. The Oth-order terms cancel. As for the Ist-order 
terms, 


ElW'(U)A.x, — YUDA y] = EI UDA, @ — y) 
= E(W'(U,)A,] - Elx, — y,] = 0. 


The second equality here crucially uses the fact that x;, y, are independent 
of U,, A;. The final equality only uses the fact that x, and y, have matching 
1st moments (and not the stronger assumption that both of these 1st moments 
are 0). The 2nd- and 3rd-order terms will similarly cancel, using the fact that 
x, and y, have matching 2nd and 3rd moments. Finally, for the “error” term 
we'll just use |Y” (U*)|, |Y” (U%")| < C and the triangle inequality; we thus 
obtain 


IE[Y(H;-1) — WADI < & - (EL(Arx,)*] + EKA yp D. 
To complete the proof of (11.30) we now just need to bound 
E[(A,x,)"], EKA, y] < 9 - Inf, FY, 


which we’ll do using the Bonami Lemma. We’ll give the proof for E[(A;x;)‘], 
the case of E[(A, yp] being identical. We have 


A,X; = LFI a Vreis tR 
where 


L, F(x) = xD Fæ) = X FO | [x 


Sat ieS 


Since L,;F has degree at most k we can apply the Bonami Lemma (more 
precisely, Corollary 9.6) to obtain 


ELA, x] 20 EIL, Oise. Dee Xr Xith ea 


But since y,,..., Y,;_;,X1,---,Xn are independent with mean O and 2nd 
moment 1, we have (see Remark 11.64) 


E[L, F(y,, a) Yi—1; Xt, X41, s.. ee 
= X GEFF = X F(sy = Inf, [F]. 


SC[n] Sot 


Thus we indeed have E[(A,x,)*] < 9 - Inf,[F']?, and the proof is complete. 
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Corollary 11.67. In the setting of the preceding theorem, if we furthermore 
have Var[F] < 1 and Inf,[F] < € for allt € [n], then 


JEL (F(x))] — ELY (FOI < § ko ve. 


Proof, We have Y`, Inf,[F]I < €X, Inf,[F] < X; |S|F(S)? < k Var[F]. 


Corollary 11.68. In the setting of the preceding corollary, if we merely have 
that Y : R —> R is c-Lipschitz (rather than €'), then 


IELY(F(x))] — E[W(F(y))]] < O(©)- 2e!*. 


Proof. Just as in the proof of Corollary 11.59, by using Vy from Proposi- 
tion 11.58 (which has ||% llo < O(c/n*)) we obtain 


IELV(F(x))] — EIF ODI] < OC) - +k e/n°). 
The proof is completed by taking n = /k9ke < 2*e!/4, 


Let’s connect this last corollary back to the study of Boolean functions. Sup- 
pose f : {—1, 1}} — R has e-small influences (in the sense of Definition 6.9) 
and degree at most k. Letting g = (g,,..., g,,) be a sequence of independent 
standard Gaussians, Corollary 11.68 tells us that for any Lipschitz y we have 


k 1/4 
E pE- LE, LW el] < O. 11.31) 
Here the expression “f(g)” is an abuse of notation indicating that the real 
numbers g,,...,g,, are substituted into f’s Fourier expansion (multilinear 
polynomial representation). 

At first it may seem peculiar to substitute arbitrary real numbers into the 
Fourier expansion of a Boolean function. Actually, if all the numbers being 
substituted are in the range [—1, 1] then there’s a natural interpretation: as you 
were asked to show in Exercise 1.4, if u € [—1, 1]”, then f(u) = E[f(y)] 
where y ~ {—1, 1}” is drawn from the product distribution in which E[y;] = 
Li. On the other hand, there doesn’t seem to be any obvious meaning when real 
numbers outside the range [—1, 1] are substituted into f’s Fourier expansion, 
as may certainly occur when we consider f(g). 

Nevertheless, (11.31) says that when f is a low-degree, small-influence 
function, the distribution of the random variable f(g) will be close to that 
of f(x). Now suppose f : {—1, 1}” —> {—1, 1} is Boolean-valued and unbi- 
ased. Then (11.31) might seem impossible; how could the continuous random 
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variable f(g) essentially be —1 with probability 1/2 and +1 with probabil- 
ity 1/2? The solution to this mystery is that there are no low-degree, small- 
influence, unbiased Boolean-valued functions. This is a consequence of the 
OSSS Inequality — more precisely, Exercise 8.44(b) — which shows that in this 
setting we will always have € > 1/k? in (11.31), rendering the bound very 
weak. If the Aaronson—Ambainis Conjecture holds (see the notes in Chap- 
ter 8.7), a similar statement is true even for functions with range [—1, 1]. 

The reason (11.31) is still useful is that we can apply it to small-influence, 
low-degree functions which are almost {—1, 1}-valued, or [—1, 1]-valued. Such 
functions can arise from truncating a very noise-stable Boolean-valued function 
to a large but constant degree. For example, we might profitably apply (11.31) to 
f= Maj=* and then deduce some consequences for Maj, (x) using the fact that 
E[(Maj=*(x) — Maj, (x))?] = W>*[Maj,,] < O(1/k) (Corollary 5.23). Let’s 
consider this sort of idea more generally: 


Corollary 11.69. Let f : {—1, 1}” —> R have Var[ f] < 1. Let k > 0 and sup- 
pose f=* has €-small influences. Then for any c-Lipschitz Y : R —> R we 
have 


< Ole). (el + I f7). 


(11.32) 
In particular, suppose h : {—1, 1}" > R has Var[h] < 1 and no (€, 5)-notable 
coordinates (we assume € < 1, ô < 5) Then 


E pY = E pE] 


a ús g~N(0, 1)" 


E „ATAO E fy (Tish(gy]] s O(c). 


x~{-1,1 g~N(0,1)" 


Proof. For the first statement we simply decompose f = f=* + f?*. Then the 
left-hand side of (11.32) can be written as 


EW (fx) + FE — Elf =*(g) + £7*(g))]| 
< [ELSE ED] — EIEEEI + c EUFO] + c ELSE], 


using the fact that y is c-Lipschitz. The first quantity is at most O(c) - 2*e!/4, 
by Corollary 11.68 (even if k is not an integer). As for the other two quantities, 
Cauchy—Schwarz implies 


EUFO < VELAS |X FSP = IEP I. 
|S|>k 
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and the same bound also holds for E[| FDI; this uses the fact that 
E[f >g] = Visine fisy just as in Remark 11.64. This completes the proof 
of (11.32). 

As for the second statement of the corollary, let f = T;_sh. The assumptions 
on h imply that Var[ f] < 1 and that f=* has €-small influences for any k; the 
latter is true because 


mif = > a4) acs < Sod — 8) 'a(sy = Inf Th] < e 
|S|<k,S3i Si 


since h has no (e€, 5)-notable coordinate. Furthermore, 


I7 = Xa- SPAS < (1 — 6) Var[h] < (1 — 8)°* < exp(—2kô) 
|S|>k 


for any k > 1; ie., || f**|l2 < exp(—kô). So applying the first part of the corol- 
lary gives 


IELW(f(@))] — ELY (f(g) < O(c): (2%€!/* + exp(—ks)) (11.33) 
for any k > 0. Choosing k = } In(1/e), the right-hand side of (11.33) becomes 
O(c) . (ETDn /4 4. 69/3) < O(c) . eb, 


where the inequality uses the assumption 6 < x (numerically, 1 — i In2 = 5). 
This completes the proof of the second statement of the corollary. 


Finally, if we think of the Basic Invariance Principle as the nonlinear ana- 
logue of our Variant Berry—Esseen Theorem, it’s natural to ask for the nonlinear 
analogue of the Berry—Esseen Theorem itself, i.e., a statement showing cdf- 
closeness of F(x) and F(g). It’s straightforward to obtain a Lévy distance 
bound just as in the degree-1 case, Corollary 11.61; Exercise 11.44 asks you to 
show the following: 


Corollary 11.70. In the setting of Corollary 11.67 we have the Lévy distance 
bound d, (F(x), F(y)) < O(2*e!/>). In the setting of Remark 11.66 we have 
the bound di (F(x), F(y)) < /p)?e!®. 


Suppose we now want actual cdf-closeness in the case that y ~ N(O, 1)”. In 
the degree-1 (Berry—Esseen) case we used the fact that degree-1 polynomials of 
independent Gaussians have good anticoncentration. The analogous statement 
for higher-degree polynomials of Gaussians is not so easy to prove; however, 
Carbery and Wright (Carbery and Wright, 2001, Theorem 8) have obtained the 
following essentially optimal result: 
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Carbery-Wright Theorem. Let p : R” —> R be a polynomial (not necessarily 
multilinear) of degree at most k, let g ~ N(O, 1)", and assume E[p(g)7] =]; 
Then for all € > Q, 


Pr[|p(g)| < €] < Oke’), 
where the O(-) hides a universal constant. 
Using this theorem it’s not hard (see Exercise 11.45) to obtain: 


Theorem 11.71. Let f : {—1, 1}” > R be of degree at most k, with €-small 
influences and Var[ f] = 1. Then for allu € R, 


|PrL f(x) < u] — Pri f(g) < ull < O) - e", 


where the O(-) hides a universal constant. 


11.7. Highlight: Majority Is Stablest Theorem 


The Majority Is Stablest Theorem (to be proved at the end of this section) 
was originally conjectured in 2004 (Khot et al., 2004, 2007). The motivation 
came from studying the approximability of the Max-Cut CSP. Recall that 
Max-Cut is perhaps the simplest possible constraint satisfaction problem: the 
domain of the variables is Q = {—1, 1} and the only constraint allowed is the 
binary non-equality predicate, 4: {—1, 1} — {0, 1}. As we mentioned briefly 
in Section 7.3, Goemans and Williamson (Goemans and Williamson, 1995) 
gave a very sophisticated efficient algorithm using “‘semidefinite programming” 
which (cow, 6)-approximates Max-Cut for every 6, where caw © .8786 is a 
certain trigonometric constant. 

Turning to hardness of approximation, we know from Theorem 7.40 (devel- 
oped in (Khot et al., 2004)) that to prove UG-hardness of (œ + 6, B — ô)- 
approximating Max-Cut, it suffices to construct an (œ, £)-Dictator-vs.-No- 
Notables test which uses the predicate 4. As we’ll see in this section, the 
quality of the most natural such test can be easily inferred from the Majority 
Is Stablest Theorem. Assuming that theorem (as Khot et al. (Khot et al., 2004) 
did), we get a surprising conclusion: It’s UG-hard to approximate the Max- 
Cut CSP any better than the Goemans—Williamson Algorithm does. In other 
words, the peculiar approximation guarantee of Goemans and Williamson on 
the very simple Max-Cut problem is optimal (assuming the Unique Games 
Conjecture). 

Let’s demystify this somewhat, starting with a description of the Goemans— 
Williamson Algorithm. Let G = (V, E) be an n-vertex input graph for the 
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algorithm; we’ll write (v, w) ~ E to denote that (v, w) is a uniformly random 
edge (i.e., 4-constraint) in the graph. The first step of the Goemans—Williamson 
Algorithm is to solve following optimization problem: 


maximize E * [} — 1(U(v), Gw)) 
R (SDP) 
subject to Ü: V> s., 


Here S”-! denotes the set of all unit vectors in R”. Somewhat surprisingly, 
since this optimization problem is a “semidefinite program” it can be solved 
in polynomial time using the Ellipsoid Algorithm. (Technically, it can only be 
solved up to any desired additive tolerance € > 0, but we’ll ignore this point.) 
Let’s write SDPOpt(G) for the optimum value of (SDP), and Opt(G) for the 
optimum Max-Cut value for G. We claim that (SDP) is a relaxation of the 
Max-Cut CSP on input G, and therefore 


SDPOpt(G) > Opt(G). 


To see this, simply note that if F* : V — {—1, 1} is an optimal assignment 
(“cut”) for G then we can define U(v) = (F*(v),0,...,0) € S”~! for each 
v € V and achieve the optimal cut value Valg(F*) in (SDP). 

The second step of the Goemans—Williamson Algorithm might look familiar 
from Fact 11.7 and Remark 11.8. Let U* : V > S"~! be the optimal solution 
for (SDP), achieving SDPOpt(G); abusing notation we’ll write U *(v) =v. 
The algorithm now chooses g ~ N(0, 1)” at random and outputs the assign- 
ment (cut) F : V —> {—1, 1} defined by F(v) = sgn((v, g)). Let’s analyze the 
(expected) quality of this assignment. The probability the algorithm’s assign- 
ment F cuts a particular edge (v, w) € E is 

sey ERS) zene): 
This is precisely the probability that sgn(z) 4 sgn(z’) when (z, z’) is a pair of 
(v, w)-correlated 1-dimensional Gaussians. Writing Z(v, w) € [0, 2] for the 
angle between the unit vectors v, w, we conclude from Sheppard’s Formula 
(see (11.2)) that 
Z(v, w) 
ae cuts edge (v, w)] = = 


By linearity of expectation we can compute the expected value of the algo- 
rithm’s assignment F: 


E[Valg(F)]}= E [2@,w)/z]. (11.34) 
g (v,w)~E 
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On the other hand, by definition we have 


SDPOpt(G) = „E ME — į cos Z(v, w)]. (11.35) 
v,w)~ 


It remains to compare (11.34) and (11.35). Define 


0/7 
Cow = min T= Noose = .8786. (11.36) 
eelo] 7 cos 0 


Then from (11.34) and (11.35) we immediately get 


E[Valc(F)] > caw - SDPOpt(G) > caw : Opt(G); 
£ 


i.e., in expectation the Goemans—Williamson Algorithm delivers a cut of value 
at least cgw times the Max-Cut. In other words, it’s a (cgwf, 6)-approximation 
algorithm, as claimed. By being a little bit more careful about this analysis 
(Exercise 11.33) you can show following additional result: 


Theorem 11.72. (Goemans and Williamson, 1995). Let 0 € [0*, 2], where 
0* x 14r is the minimizing 0 in (11.36) (also definable as the Bove solu- 
tion of tan(@/2) = 0). Then on any graph G with SDPOpt(G) > 5 — 5 cos 6, 
the Goemans—Williamson Algorithm produces a cut of ( Ma value at 
least 0/7. In particular, the algorithm is a (0/7, 4 — } cos 0)-approximation 


2 
algorithm for Max-Cut. 


Example 11.73. Consider the Max-Cut problem on the 5-vertex cycle graph Zs. 
The best bipartition of this graph cuts 4 out of the 5 edges; hence Opt(Zs) = 4. 
Exercise 11.32 asks you to show that taking 


U (v) = (cos “2, sin 4), v E Zs, 


in the semidefinite program (SDP) establishes that SDPOpt(Z5) > 5 — 
+ cos 1, (These are actually unit vectors in IR? rather than in R* as (SDP) 
requires, but we can pad out the last three coordinates with zeroes.) This exam- 
ple shows that the Goemans—Williamson analysis in Theorem 11.72 lower- 
bounding P in terms of SDPOpt(G) cannot be improved (at least when 
SDPOpt(G) = 3). This is termed an optimal ouegraliiy gap. In Tagh Theo- 
rem 11.72 also implies that SDPOpt(Zs) must equal 5 ams 7 008 $ =, for if it 
were greater, the theorem would falsely imply that Opt(Zs) > $, Note that the 
Goemans-Williamson Algorithm actually finds the maximum cut when run on 
the cycle graph Zs. For a related example, see Exercise 11.35. 


11.7. Highlight: Majority Is Stablest Theorem 369 


Now we explain the result of Khot et al. (Khot et al., 2004), that the Majority 
Is Stablest Theorem implies it’s UG-hard to approximate Max-Cut better than 
the Goemans—Williamson Algorithm does: 


Theorem 11.74. (Khot et al., 2004). Let 0 € (4, 2). Then for any ô > 0 it’s 


UG-hard to (0/m + ô, 5 — 5 cos @)-approximate Max-Cut. 


Proof. It follows from Theorem 7.40 that we just need to construct a 
(0/x,4— $ cos @)-Dictator-vs.-No-Notables test using the predicate 4. (See 
Exercise 11.36 for an extremely minor technical point.) It’s very natural to try 
the following, with £ = $ — 4 cos6 € (5, 1): 


B-Noise Sensitivity Test. Given query access to f : {—1, 1}" > {-1, 1}: 


e Choose x ~ {—1, 1}" and form x’ by reversing each bit of x independently 
with probability B = 5 E 5 
cos 0-correlated strings. (Note that cos@ < 0.) 


e Query f atx, x’. 
e Accept if f(x) # f(x’). 


By design, 


cos @. In other words let (x, x') be a pair of 


Pr[the test accepts f] = NSg[f] = $ — 5Stabeose[f]. (11.37) 


(We might also express this as “RS;(@)”.) In particular, if f is a dictator, 
it’s accepted with probability exactly B = 5 = i cos 0. To complete the proof 
that this is a (0/7, 4 — $ cos 0)-Dictator-vs.-No-Notables test, let’s suppose 
f :{-1, 1} > [-1, Ihas no (€, €)-notable coordinates and show that (11.37) 
is at most 6/m+o0,(1). (Regarding f having range [-—1, 1], recall 
Remark 7.38.) 

At first it might look like we can immediately apply the Majority Is Stablest 
Theorem; however, the theorem’s inequality goes the “wrong way” and the 
correlation parameter o = cos@ is negative. These two difficulties actually 
cancel each other out. Note that 


Pr[the test accepts f] = 5 — xStabcosal f] 
= 3-5) oso W*[f] 
k=0 
< 5 +5 Ye cos0)*W*[f] (since cos@ < 0) 
k odd 
=ni odd 
=35 + zStab- cosol f ], (11.38) 
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where fos : {-—1, 1}” — [-1, 1] is the odd part of f (see Exercise 1.8) defined 
by 


PË = SF @)— fx) = YO Fisyx®. 


IS] odd 


Now we’re really in a position to apply the Majority Is Stablest Theorem 
to poi, because — cos 0 € (0, 1), E[ f°] = 0, and fo has no (€, €)-notable 
coordinates (since it’s formed from f by just dropping some terms in the 
Fourier expansion). Using — cos 0 = cos(z — 0), the result is that 


Stab- coso [ f°] < 1 — 2 arccos(cos(r — 0)) + oe(1) = 20/7 — 1 +0,(1). 
Putting this into (11.38) yields 


Pr[the test accepts f] < 5 + 5(20/x —1+0,.(1)) = 0/7 +0,(1), 


as needed. 


Remark 11.75. There’s actually still a mismatch between the algorithmic guar- 
antee of Theorem 11.72 and the UG-hardness result Theorem 11.74, concerning 
the case of 0 € (4, 6*). In fact, for these values of 0 — i.e., 5 <B Z .8446 — 
neither result is sharp; see O’ Donnell and Wu (O’ Donnell and Wu, 2008). 


Remark 11.76. If we want to prove UG-hardness of (6’/7 + 6, 4 — 5 cos 6")- 
approximating Max-Cut, we don’t need the full version of Borell’s Isoperi- 
metric Theorem; we only need the volume-+ case with parameter 0 = m — 0’. 
Corollary 11.44 gave a simple proof of this result for 0 = 7, hence 6’ = arr. 
This yields UG-hardness of G +ô, 5 + 5y5)-approximating Max-Cut. The 
ratio between « and £ here is approximately .8787, very close to the Goemans- 
Williamson constant cgw œ% .8786. 


Finally, we will prove the General- Volume Majority Is Stablest Theorem, by 
using the Invariance Principle to reduce it to Borell’s Isoperimetric Theorem. 


General-Volume Majority Is Stablest Theorem. Let f : {—1, 1}" — [0, 1]. 


Suppose that MaxInf[ f] < €, or more generally, that f has no (e, ae 
notable coordinates. Then for any 0 < p < 1, 
log log(1/e 
Stab [f] < A ELF) + OCRE) . L. (11.39) 


(Here the O(-) bound has no dependence on p.) 


Proof. The proof involves using the Basic Invariance Principle twice (in the 
form of Corollary 11.69). To facilitate this we introduce f’ = Tı-5 f, where 
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(with foresight) we choose 


— 2 loglog(1/e) 
8=3 log(/e) ' 
(We may assume € is sufficiently small so that O < 6 < 5) Note that E[ f’] = 
E[ f] and that 


Stab [f] = >> oa — 6)! f(S) = Stabal f]. 


S¢[n] 


But 


|Stabp«—sp[f] — Stab [ f]| < (ep — P0 — 8) - 1 - Varl f] < 28- 15 
(11.40) 

by Exercise 2.46, and with our choice of 6 this can be absorbed into the error 
of (11.39). Thus it suffices to prove (11.39) with f’ in place of f. 

Let Sq : R — R be the continuous function which agrees with t + 1? for 
t € [0, 1] and is constant outside [0, 1]. Note that Sq is 2-Lipschitz. We will 
apply the second part of Corollary 11.69 with “h” set to T pf (and thus 
Ti 39h =T yf "). This is valid since the variance and (1 — 6)-stable influences 
of h are only smaller than those of f. Thus 


E ST ef CM E [SaT yo fel] < OC) = Opa): 
(11.41) 


using our choice of 6. (In fact, it’s trading off this error with (11.40) that led to 
our choice of 5.) Now T p f(x) = Ta-s) yp f(x) is always bounded in [0, 1], 
so 


SqT yf) TO = E (Sa(T yp f'@))1 = Stabl f]. 


Furthermore, T 7 f’(g) is the same as U 7 f(g) because f’ is a multilinear 
polynomial. (Both are equal to f’(og); see Fact 11.13.) Thus in light of (11.41), 
to complete the proof of (11.39) it suffices to show 


ook yl SAU of (2001 — AV ELF D] < Olga) 014 


Define the function F : R” — [0, 1] by 
0 if f(g) < 0, 


F(g) = trunco, (F8) = 4 f8) if f(g) € [0, 1], 
1 if f(g) > 1. 
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We will establish the following two inequalities, which together imply (11.42): 


ook, p SUA EDI- E [Sa yo F(g))]] < O(s) (11.43) 


E p ISU AED) < AEL + O (rgi) (11.44) 


Both of these inequalities will in turn follow from 


Epl OFON E istono] < Olg) (11.45) 


Let’s show how (11.43) and (11.44) follow from (11.45), leaving the proof 
of (11.45) to the end. For (11.43), 
|EISq(U yp f’(g))] — EISqU jp F(g))]| < 2EUU yp f(g) — U yp F(a] 

< 2EI| f(g) — FOI < O(a). 


where the first inequality used that Sq is 2-Lipschitz, the second inequality 
used the fact that U z is a contraction on L'(R", y), and the third inequality 
was (11.45). As for (11.44), U pF is bounded in [0, 1] since F is. Thus 


E[Sq(U jp F(g))] = ELU jp F(g))"] = Stab, [F] < A,(ELF(g)]), 


where we used Borell’s Isoperimetric Theorem. But | E[F(g)] — ELf’(g)]| < 
O (r075) by (11.45), and A, is easily shown to be 2-Lipschitz (Exer- 
cise 11.19(e)). This establishes (11.44). 

It therefore remains to show (11.45), which we do by applying the Invariance 
Principle one more time. Taking y to be the 1-Lipschitz function distjo,ı} in 
Corollary 11.69 we deduce 


zD plist, EDI- EF plsti a œ] < olèh) = O(a): 


But E[distjo, 1] f’(x)] = 0 since f'(x) = Tı- f(x) € [0, 1] always. This estab- 
lishes (11.45) and completes the proof. 


We conclude with one more application of the Majority Is Stablest Theo- 
rem. Recall Kalai’s version of Arrow’s Theorem from Chapter 2.5, i.e., Theo- 
rem 2.56. It states that in a 3-candidate Condorcet election using the voting rule 
f :{-1, 1} — {-1, 1}, the probability o having a Condorcet winner — often 
called a rational outcome — is precisely 3 = 3Stab_| Lf]. AS we saw in the 
proof of Theorem 11.74 near (11.38), this is in turn at most 3 + 3Stab1/3[ f°], 
with equality if f is already odd. It follows from the Majority Is Stablest Theo- 
rem that among all voting rules with €-small influences (a condition all reason- 
able voting rules should satisfy), majority rule is the “most rational”. Thus we 


11.8. Exercises and Notes 373 


see that the principle of representative democracy can be derived using analysis 
of Boolean functions. 
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11.1 Let æ be the set of all functions f : R” —> R which are finite linear 
combinations of indicator functions of boxes. Prove that is dense in 
Li(R", y). 

11.2 Fill in proof details for the Gaussian Hypercontractivity Theorem. 

11.3 Prove Fact 11.13. (Cf. Exercise 2.25.) 

11.4 Show that U,,U,, = Upp for all p1, p2 € [—1, 1]. (Cf. Exercise 2.32.) 

11.5 Prove Proposition 11.16. (Hint: For p Æ 0, write g(z) = U, f(z) and 
show that g(z/p) is a smooth function using the relationship between 
convolution and derivatives.) 

11.6 (a) Prove Proposition 11.17. (Hint: First prove it for bounded continu- 

ous f; then make an approximation and use Proposition 11.15.) 
(b) Deduce more generally that for f € L'(R", y) the map p + U, f is 

“strongly continuous” on [0, 1], meaning that for any p € [0, 1] we 

have ||U, f — U, f|l1 —> 0 as p’ > p. (Hint: Use Exercise 11.4.) 

11.7 Complete the proof of Proposition 11.26 by establishing the case of 
general n. 

11.8 Complete the proof of Proposition 11.28 by establishing the case of 
general n. 

11.9 (a) Establish the alternative formula (11.10) for the probabilists’ Her- 

mite polynomials H;(z) given in Definition 11.29; equivalently, 
establish the formula 


n d j 


(Hint: Complete the square on the left-hand side of (11.8); then 
differentiate j times with respect to ¢ and evaluate at 0.) 
(b) Establish the recursion 


Hj2)=@-H)H1@—) = O= p E- ihj) 


for j e N+, and hence the formula Aj(z)=(@- ae 
(c) Show that A;(z) is an odd function of z if j is odd and an even 
function of z if j is even. 
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11.10 (a) Establish the derivative formula for Hermite polynomials: 
H@=j- Hig => A@=VJj-hj-1@). 
(b) By combining this with the other formula for H 4) implicit in 
Exercise 11.9(b), deduce the recursion 
Ay4i(2) = 2H;(z) — j Hj-1(2). 
(c) Show that H;(z) satisfies the second-order differential equation 
jH;(2) = 2H; (z) — Hj (2). 


(It’s equivalent to say that h ;(z) satisfies it.) Observe that this is con- 
sistent with Propositions 11.26 and 11.40 and says that H; (equiva- 
lently, h ;) is an eigenfunction of the Ornstein—Uhlenbeck operator L, 
with eigenvalue j. 


11.11 Prove that 
j j 
Ha +y)=). (i)m 
k=0 


11.12 (a) By equating both sides of (11.8) with 


E PCED) 


(where i = ./—1), show that 


H= EB le+ ig)’]. 


(b) Establish the explicit formulas 


Li/2] ; 
J i 
pele 3 Coy a) Eae K = 


k=0 


ye ee zi-2 n zi—4 
= on G 2G) MGA! 


gE% 
-ar gt) 


11.13 (a) Establish the formula 
EVA = X lel fo? 


acN” 


for all f € L?(R", y) (or at least for all n-variate polynomials f). 
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(b) For fe LR”, y), establish the formula 


D EIVari f = $ ofa). 
i=1 f 


aeN” 
11.14 Show that for all j € N and all z € R we have 


n\ —1/2 (n) n Jn n—>oo 
(‘) K; (5-2) —> h;(z), 


where K 9 is the Kravchuk polynomial of degree j from Exercise 5.28 


(with its dependence on n indicated in the superscript). 

11.15 Recall the definition (11.13) of the Gaussian Minkowski content of the 
boundary ðA of a set A C R”. Sometimes the following very similar 
definition is also proposed for the Gaussian surface area of A: 

vol, ({z : dist(z, A) < €}) — vol, (A) 

< . 


M(A) = lim inf 
Consider the following subsets of R: 
A, =9, Az = {0}, A3=(—00,0), Ag = (~œ, 0], 
As =R\ {0}, Ao=R. 


(a) Show that 


H(A) =0 M(A,) =0 surf, (A1) = 0 
y™(4) =-= M(A2) = Ae surf, (Az) = 0 
yt (A3) = Tm M(A3) = Tm surf, (A3) = ror 
y*(Ay= ze M(Ay)= se urf (A) = -y 
y*(As) = = M(As) =0 surf, (As) = 0 
+(Ag) =0 M(Ao) = 0 surf, (A6) = 0. 


(b) For A C R", the essential boundary (or measure-theoretic bound- 
ary) of A is defined to be 


L(ANB 
jaa eRe in ee 
s>0* — vol, (Bs(x)) 


soi} 


where B(x) denotes the ball of radius 6 centered at x. In other 
words, 0,A is the set of points where the “local density of A” is 
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11.17 


11.18 


11.19 
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strictly between 0 and 1. Show that if we replace 0A with 0,A 
in the definition (11.13) of the Gaussian Minkowski content of the 
boundary of A, then we have the identity y+ (0,A;) = surf, (A;) for 
all 1 <i < 6. Remark: In fact, the equality y*(0,A) = surf, (A) is 
known to hold for every set A such that 0,.A is “rectifiable”. 

Justify the formula for the Gaussian surface area of unions of intervals 

stated in Example 11.50. 

(a) Let B, C R” denote the ball of radius r > 0 centered at the origin. 
Show that 


surf, (B,) = ————r" le., (11.46) 
2"72(n 2)! 


(b) Show that (11.46) is maximized when r = yn — 1. (Incase n = 1, 
this should be interpreted as r > 0*.) 
(c) Let S(n) denote this maximizing value, i.e., the value of (11.46) with 


r = Vn — 1. Show that S(n) decreases from , [2 to a limit of 5 
as n increases from 1 to co. 


(a) For fe L?(R", y), show that L f is defined, i.e., 


li f = Us- f 
im ———— 
t>0 t 
exists in L7(R", y), if and only if owen la|? F(a)? < oo. (Hint: 
Proposition 11.37.) 
(b) Formally justify Proposition 11.40. 
(c) Let f € L?(R", y). Show that U, f is in the domain of L for any 
p E€ (—1, 1). 
Remark: It can be shown that the @ hypothesis in Propositions 11.26 
and 11.28 is not necessary (provided the derivatives are interpreted in the 
distributional sense); see, e.g., Bogachev (Bogachev, 1998, Chapter 1) 
for more details. 
This exercise is concerned with (a generalization of) the function appear- 
ing in Borell’s Isoperimetric Theorem. 


Definition 11.77. For p € [—1, 1] we define the Gaussian quadrant 
probability function A, : [0, 1? = [0, 1] by 


Ap(a, B) = Bry. [esa <t], 
(z,z’) p-correlated 
standard Gaussians 


11.20 


11.21 
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where tf and t’ are defined by ®(t) =a, ®(t') = B. This is a slight 
reparametrization of the bivariate Gaussian cdf. We also use the short- 
hand notation 


Ap(@) = A,(@, a), 


which we encountered in Borell’s Isoperimetric Theorem (and also in 
Exercises 5.32 and 9.24, with a different, but equivalent, definition). 


(a) Confirm the statement from Borell’s Isoperimetric Theorem, that for 
every H C R” with vol, (H) = a we have Stab, [17] = A,(a). 
(b) Verify the following formulas: 


Ap(@, B) = ACB, æ), 
Aol, B) = aß, 
Ai, 8B) = min(a, L), 
A_i(a@, B) = max(a + £ — 1,0), 
A,(a, 0) = A,(0, œ) = 0, 
A,(@, 1) = Al, œ) =a, 
A_p(@, $) =a— Apl, 1 — p) = B— A, (1 — a, B), 
Ap; 2) = 3 2 
(c) Prove that A (œ, 6) 2 œf according as p 2 0, forall0 < a, B < 1. 
(d) Establish 


arccos p 
T 


Tna po] Zne p = o 

Han Oe -p dB pee 1-0)’ 

where t = 7! (æ), t' = 7! (8) as usual. 
(e) Show that 

|A,(@, B) — Ap(a’, B')| < læ — a'l + |B — B'I, 

and hence A ,(@) is a 2-Lipschitz function of a. 
Show that the general-n case of Bobkov’s Inequality follows by induction 
from the n = | case. 
Let f : {-1, 1}” > {-1, 1} and leta = min{Pr[f = 1], Pr[ f = —1]}. 
Deduce I[ f] > 4%(a)* from Bobkov’s Inequality. Show that this recov- 
ers the edge-isoperimetric inequality for the Boolean cube (Theo- 


rem 2.39) up to a constant factor. (Hint: For the latter problem, use 
Proposition 5.27.) 
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11.24 


11.25 
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Let dı, d2 € N. Suppose we take a simple random walk on Z, starting 
from the origin and moving by +1 at each step with equal probability. 
Show that the expected time it takes to first reach either —d, or +d) is 
dd. 

Prove Claim 11.54. (Hint: For the function V,(t) appearing in the proof 
of Bobkov’s Two-Point Inequality, you’Il want to establish that V/”"(0) = 


0 and that V/""(0) = ee > 0.) 


Prove Theorem 11.55. (Hint: Have the random walk start at yọ = a + pb 
with equal probability, and define z; = ||(@%(y,), pb, t/t)||. You’ll need 
the full generality of Exercise 11.22.) 


Justify Remark 11.41 (in the general-volume context) by showing 
that Borell’s Isoperimetric Theorem for all functions in K = {f : 
R” > [0, 1] | EL f] = œ} can be deduced from the case of functions 
indK = {f : R” > {0, 1} | EL f] = a}. (Hint: As stated in the remark, 
the intuition is that \/Stab,[f] is a norm and that K is a convex set 
whose extreme points are 0K. To make this precise, you may want to 
use Exercise 11.1.) 

The goal of this exercise and Exercises 11.27—11.29 is to give the proof 
of Borell’s Isoperimetric Theorem due to Mossel and Neeman (Mossel 
and Neeman, 2012). In fact, their proof gives the following natural “two- 
set” generalization of the theorem (Borell’s original work (Borell, 1985) 
proved something even more general): 


Two-Set Borell Isoperimetric Theorem. Fix p € (0,1) and a,B € 
[0, 1]. Then for any A, B C R” with vol, (A) = «, vol,(B) = £, 


Pr [z € A,z’ € B] < A, (a, B). (11.47) 
(z,z’) p-correlated 
n-dimensional Gaussians 

By definition of A,(a, 8), equality holds if A and B are parallel half- 
spaces. Taking 8 = a and B = A in this theorem gives Borell ’s Isoperi- 
metric Theorem as stated in Section 11.3 (in the case of range {0, 1}, at 
least, which is equivalent by Exercise 11.25). It’s quite natural to guess 
that parallel halfspaces should maximize the “joint Gaussian noise sta- 
bility” quantity on the left of (11.47), expecially in light of Remark 10.2 
from Chapter 10.1 concerning the analogous Generalized Small-Set 
Expansion Theorem. Just as our proof of the Small-Set Expansion The- 
orem passed through the Two-Function Hypercontracitivity Theorem to 
facilitate induction, so too does the Mossel—-Neeman proof pass through 
the following “two-function version” of Borell’s Isoperimetric Theorem: 


11.27 


11.28 
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Two-Function Borell Isoperimetric Theorem. Fix o € (0, 1) and let 
fig € L?(R", y) have range (0, 1]. Then 
„E LA(S), gE] < Ap (ELSI, Elg). 
(z,z') p-correlated 
n-dimensional Gaussians 

(a) Show that the Two-Function Borell Isoperimetric Theorem implies 
the Two-Set Borell Isoperimetric Theorem and the Borell Isoperi- 
metric Theorem (for functions with range [0, 1]). (Hint: You may 
want to use facts from Exercise 11.19.) 

(b) Show conversely that the Two-Function Borell Isoperimetric Theo- 
rem (in dimension n) is implied by the Two-Set Borell Isoperimetric 
Theorem (in dimension n + 1). (Hint: Given f : R” — [0, 1], define 
A C R"! by(z,NeA => f(z) = P(t).) 

(c) Let £1, £2: R” > R be defined by £;(z) = (a, z) + bi for some 
a € R”, bj, b2 € R. Show that equality occurs in the Two-Function 
Borell Isoperimetric Theorem if f(z) = leoso, g(z) = lezo or 
if f(z) = PE), g) = P(E2(z)). 

Show that the inequality in the Two-Function Borell Isoperimetric The- 

orem “tensorizes” in the sense that if it holds for n = 1, then it holds 

for all n. Your proof should not use any property of the function Ap, 
nor any property of the p-correlated n-dimensional Gaussian distribu- 
tion besides the fact that it’s a product distribution. (Hint: Induction 
by restrictions as in the proof of the Two-Function Hypercontractivity 

Induction Theorem from Chapter 9.4.) 

Let 71, 2 C R be open intervals and let F : I; x h > R be E. For 

p € R, define the matrix 


H,F =(HF)o k (| l 
pl 

where HF denotes the Hessian of F and o is the entrywise (Hadamard) 
product. We say that F is p-concave (terminology introduced by 
Ledoux (Ledoux, 2013)) if H,F is everywhere negative semidefinite. 
Note that the o = 1 case corresponds to the usual notion of concav- 
ity, and the pọ = 0 case corresponds to concavity separately along the 
two coordinates. The goal of this exercise is to show that the Gaussian 
quadrant probability A, function is p-concave for all p € (0, 1). 
(a) Extending Exercise 11.19(d), show that for any p € (—1, 1), 


type ecg (4 
da? PO Jl—p2 $4) JI- 
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and deduce a similar formula for £ = A,(a, B). 
(b) Show that 


d? eB 1 1 b t — pt 
——A,(a, B) = > i ; 
dadp ” Jl—p2 (t) JV1l— p? 
and deduce a similar (in fact, equal) formula for 7 £ qa ^o, P). 
(c) Show that det(H, Ap) = 0 on A of 0; 17. 


(d) Show that if p € (0, 1), then £ a -Ap < 0 on (0, 1)”. Deduce 
that A, is p-concave. 


Ap, A 


11.29 This exercise is devoted to Mossel and Neeman’s proof (Mossel and 
Neeman, 2012) of the Two-Function Borell Isoperimetric Theorem in the 
casen = 1. For another approach, see Exercise 11.30. By Exercise 11.27, 
this is sufficient to establish the case of general n. (Actually, the proof 
in this exercise works essentially verbatim in the general n case, but we 
stick to n = 1 for simplicity.) 
(a) More generally, we intend to prove that for f, g : R > [0, 1], 


Mp)= E [Ap(Up f(z), Upg(z’))I 
(z,z') p-correlated 
standard Gaussians 
is a nonincreasing function of O < p < 1 (cf. Theorem 11.55). 
Obtain the desired conclusion by taking p —> 0*, 17. (Hint: You’ll 
need Exercises 11.6 and 11.19(e).) 

(b) Write fo = Up f, 8p = Upg for brevity, and write 0;A, (i = 1, 2) 
for the partial derivatives of A,. Also let hy, h2 denote independent 
standard Gaussians. Use the Chain Rule and Proposition 11.27 to 
establish 


ACP) =EL Ap) fpr), 8h: + V1 — p7ha))-Lf,(hy)] 


(11.48) 
E[(d2A p)(fo(eh2 + V1 — p7h1), 8p(h2)) - Lgp(hz)). 
(11.49) 


(c) Use Proposition 11.28 to show that the first expectation (11.48) 
equals 


Elan Ap f) fos 80) fo + p Ap f\ fos 8o): fy Sh 


where fp, f, are evaluated at h; and gp, gi, are evaluated at ph, + 
y 1 — ph». Give a similar formula for (11.49). 


(d) 


11.30 (a) 


(b) 


(c) 
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Deduce that 
MN (p) = 


[a g] Ap Ap fol), 8): Beall 
8z) 


(z,z') p-correlated 

standard Gaussians 
where H, is as in Exercise 11.28, and that indeed À is a nonincreasing 
function. 


Suppose the Two-Function Borell Isoperimetric Theorem were to 
hold for 1-bit functions, i.e., for f, g : {—1, 1} —> [0, 1]. Then the 
easy induction of Exercise 11.27 would extend the result to n-bit 
functions f, g : {—1, 1}” —> [0, 1]; in turn, this would yield the Two- 
Function Borell Isoperimetric Theorem for 1-dimensional Gaussian 
functions (i.e., Exercise 11.29), by the usual Central Limit Theorem 
argument. Show, however, that dictator functions provide a coun- 
terexample to a potential “1-bit Two-Function Borell Isoperimetric 
Theorem”. 

Nevertheless, the idea can be salvaged by proving a weakened ver- 
sion of the inequality for 1-bit functions that has an “error term” that 
is a superlinear function of f and g’s “influences”. Fix p € (0, 1) 
and some small € > 0. Let f, g : {—1, 1} > [e, 1 — €]. Show that 


ans [Ap (f(x), g] < A, (ELSI, Elg]) 


p-correlated 


+ Cpe: (ELD: fI] + El|Dig/*)), 


where Cp, is a constant depending only on p and e. (Hint: Per- 
form a 2nd-order Taylor expansion of A, around (E[ f], E[g]); in 
expectation, the quadratic term should be 


D 
[Di f Dig] (Ap Ap ELSI, Els) - Ee | 
As in Exercise 11.29, show this quantity is nonpositive.) 
Extend the previous result by induction to obtain the following 
theorem of De, Mossel, and Neeman (De et al., 2013): 


Theorem 11.78. For each p € (0,1) and € > Q, there exists a 
constant Cp « such that the following holds: If f, g : {—1, 1}" > 
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[e, 1 — €], then 


E [Af Œ), 8] < AEF], Elg]) 


p-correlated 


+ Coe (Anl f] + Anlg)). 


Here we using the following inductive notation: A\[f] = E[| f — 


E[f]|*], and 


Anlfl= EF) [AnaLie] + Moff". 


xn»~{-1,1 


(d) Prove by induction that A,[f] < 8 X ID: fI. 


(e) Suppose that f,g € L7(R,y) have range [e,1—e] and are 
c-Lipschitz. Show that for any M € N+, the Two-Function Borell 
Isoperimetric Theorem holds for f, g with an additional additive 
error of O(M~!/?), where the constant in the O(.) depends only on 


p, €, and c. (Hint: Use BitsToGaussians y.) 


(f) By an approximation argument, deduce the Two-Function Borell 
Isoperimetric Theorem for general f, g € L?(R, y) with range 


[0, 1]; i.e., prove Exercise 11.29. 


11.31 Fix 0 < p < 1 and suppose f € L!(R, y) is nonnegative and satisfies 
E[ f] = 1. Note that E[U, f] = 1 as well. The goal of this problem is to 
show that U, f satisfies an improved Markov inequality: Pr[U, f > t] = 
o(—_) = o(+) as t —> oo. This gives a quantitative sense in which U, 


tynt 


is a “smoothing operator”: U, f can never look too much like like a step 


function (the tight example for Markov’s inequality). 


(a) For simplicity, let’s first assume p = 1//2. Given t > V2, select 


h > 0 such that g(h) = t/ V/T. Show that h ~ ~21nt. 


(b) Let H = {z : U, f(z) > t}. Show that if H C (—oo, —h] U [h, ov), 
then we have Pr[U, f > t] Ș ZIZ as desired. (Hint: You’ll need 


~ tint’ 
P(u) < ylu)/u.) 


(c) Otherwise, we wish to get a contradiction. First, show that there 
exists y € (—h, h) and 59 > O such that U, f(z) >t for all t € 
(y — ôo, y + ôo). (Hint: You'll need that U, f is continuous; see 


Exercise 11.5.) 


(d) For 0 < ô < ôo, define g € L'(R, y) by g(z) = 35 l(y-8,y48)- Show 
that 0 < Upg < 7 pointwise. (Hint: Why is U, g(z) maximized at 


2y?) 
(e) Show that F > (f, Ung) > t E[g]. 


11.32 


11.33 
11.34 


11.35 


11.36 
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(f) Derive a contradiction by taking 6 —> 0, thereby showing that indeed 
Pr[U, f >t] < ce. 

(g) Show that this result is tight by constructing an appropriate f. 

(h) Generalize the pase to show that for any fixed 0 < p < 1 we have 


< 1 
Pr[U, f >t] < J i 


As described in Example 11.73, show that SDPOpt(Z5) > 5 — 


1 4r _ 5, V5 
7COS "> = Bt g- 


Prove Theorem 11.72. 

Consider the generalization of the Max-Cut CSP in which the variable 
set is V, the domain is {—1, 1}, and each constraint is an equality of two 
literals, i.e., it’s of the form bF(v) = b'F(v’) for some v, v’ € V and 
b, b’ € {-1, 1}. This CSP is traditionally called Max-E2-Lin. Given an 
instance Y, write (v, v’, b, b) ~ Ato denote a uniformly chosen con- 
straint. The natural SDP relaxation (which can also be solved efficiently) 
is the following: 


maximize E |; + 1(bU(), bO'))| 
w, v, b, b~ P 


subjectto U:V—> S"! 


Show that the Goemans—Williamson algorithm, when using this SDP, 
is a (caw, 8)-approximation algorithm for Max-E2Lin, and that it also 
has the same refined guarantee as in Theorem 11.72. 


This exercise builds on Exercise 11.34. Consider the following 
instance Y of Max-E2-Lin: The variable set is Z4 and the constraints are 


FO)=FQ), FO)=FQ), F(2)= F3), FB)=—F(). 


(a) Show that Opt(Y) = 

(b) Show that SDPOpt(D) 25 L+ 5 ee 
cise 11.32; you can use four unit vectors at 45° angles in R?.) 

(c) Deduce that SDPOpt(#) = 5 + a5 and that this is an optimal SDP 
integrality gap for Max-E2Lin. (Cf. Remark 11.76.) 

In our proof of Theorem 1 ie it’s stated that showing the -Noise Sensi- 

tivity Test is a (@/z, + — 4 + cos 0)- O E -No-Notables test implies 

the desired UG-hardness of (0/n +8,5- 5 + cos 0)-approximating Max- 

Cut (for any constant ô > 0). There are o minor technical problems 

with this: First, the test can only actually be implemented when £ 

is a rational number. Second, even ignoring this, Theorem 7.40 only 


(Hint: Very similar to Exer- 
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11.38 


11.39 


11.40 
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directly yields hardness of (6/m + ô, 5 — 5 cos 0 — 5)-approximation. 
Show how to overcome both technicalities. (Hint: Continuity.) 


Use Corollary 11.59 (and (11.28)) to show that in the setting of the Berry— 
Esseen Theorem, |||Sl|]ı — /2/z| < O(y'/3). (Cf. Exercise 5.31.) 


The goal of this exercise is to prove Proposition 11.58. 

(a) Reduce to the case c = 1. 

(b) Reduce to the case 7 = 1. (Hint: Dilate the input by a factor of 7.) 

(c) Assuming henceforth that c = n = 1, we define W(s) = E[y(s + g)] 
for g ~ N(O, 1) as suggested; i.e., Y = y « Q, where ¢ is the Gaus- 
sian pdf. Show that indeed IY — lo < v2/x < 1. 

(d) To complete the proof we need to show that for all s € R and 
k € Nt we have |ŅY®(s)| < Cy. Explain why, in proving this, we 
may assume w(s) = 0. (Hint: This requires k > 1.) 

(e) Assuming W(s) = 0, show |w(s)| = |Y x o(s)| < Cy. (Hint: 
Show that g(s) = p(s)y(s) for some polynomial p(s) and use 
the fact that Gaussians have finite absolute moments.) 

Establish the following multidimensional generalization of Proposi- 

tion 11.58: 


Proposition 11.79. Let Y : R? — R be c-Lipschitz. Then for any 
n > 0 there exists Yy : R? > R satisfying |Y — Wylloo < cv'dn and 


8? Wa llo < Cipievd /n'-! for each multi-index B € N? with |B| = 


>>; Bi = 1, where C; is a constant depending only on k. 


In Exercise 11.38 we “mollified” a function y by convolving it with 
the (smooth) pdf of a Gaussian random variable. It’s sometimes helpful 
to instead use a random variable with bounded support (but still with 
a smooth pdf on all of R). Here we construct such a random variable. 
Define b : R > R by 


1 r 
——,) if-l 1, 

eis exp ( =z) g <x< 

else. 


(a) Verify that b(x) > 0 for all x and that b(—x) = b(x). 

(b) Prove the following statement by induction on k € N: On (—1, 1), 
the kth derivative of b at x is of the form p(x)(1 — x3% . b(x), 
where p(x) is a polynomial. 

(c) Deduce that b is a smooth (6°) function on R. 

(d) Verify that C = fie b(x) dx satisfies 0 < C < oo and that we can 
therefore define a real random variable y, symmetric and supported 
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on (—1, 1), with the smooth pdf DO) = b(y)/C. Show also that for 
k € N, the numbers c; = ||b® ||. are finite and positive, where 5“ 
denotes the kth derivative of b. 

(e) Give an alternate proof of Exercise 11.38 using y in place of g. 

11.41 Fix u € R, W(s) = ls<u, and O < n < 1/2. 

(a) Suppose we approximate y by a smooth function Vn as in Exer- 
cise 11.38, i.e., we define Vn(s) = E[w(s + ng)] for g ~ N, 1). 
Show that Vy satisfies the following properties: 


° Vy is a decreasing function with WCs) < w(s) for s < u and 
U,(s) > W(s) fors > u. 

© |Y(s) — W(s)| < n provided |s — u| > O(n vlog /n)). 

e WO loo < Cx/n* for each k € N, where Cp depends only on k. 


(b) Suppose we instead approximate w by the function Vy (s)= 
E[w(s + 7y)], where y is the random variable from Exercise 11.40. 
Show that y, satisfies the following slightly nicer properties: 


° Vy is anonincreasing function which agrees with y on (oo, u — n] 
and on [u + 7, 00). 

e Y, is smooth and satisfies || vo lss 
where C; depends only on k. 


< Cx/n* for each k € N, 


11.42 Prove Corollary 11.61 by first proving 
Pr[Sy < u —2n]— O(*)yxy < PriSx < u] 
< Pr[Sy < u + 2n] + ON)yxr. 


(Hint: Obtain Pr[Sx < u — n] < ELẸ,(Sx)] © Ely, (Sy)] < PriSy < 
u + n] using properties from Exercise 11.41. Then replace u with u + n 
and also interchange Sx and Sy.) 

11.43 (a) Fix q € N. Establish the existence of a smooth function f4 : R > R 
that is 0 on (—oo, = and that agrees with some polynomial of 
degree exactly q on [. oo). (Hint: Induction on q; the base case 
q = 0 is essentially Exercise 11.41, and the induction step can be 
achieved by integration.) 

(b) Deduce that for any prescribed sequence ag, a1, a2, ... that is even- 
tually constantly 0, there is a smooth function g : R > R that is 0 
on (—oo, —$] and has g®(}) = a for all k € N. 

(c) Fix aunivariate polynomial p : R — R. Show that there is a smooth 
function Y : R > R that agrees with p on [—1, 1] and is identi- 
cally 0 on (—oo, —2] U [2, ov). 
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11.44 Establish Corollary 11.70. 
11.45 Prove Theorem 11.71. 


11.46 (a) By following our proof of the d = 1 case and using the multivariate 


(b) 


(c) 


Taylor theorem, establish the following: 


Invariance Principle for Sums of Random Vectors. Let 
X, $ ss Xas Yi, i Y be independent R¢-valued random vari- 
ables Hth S means and covariance matrices; i.e., E[X,] = 
E[Y,] and Cov[X,] = Cov[Y,] for all t € [n]. (Note that the d indi- 
vidual components of a particular xX t or Y, are not required to 
be independent.) Write Sx =F X, and S; Da Y,. Then 
for any Ê function y : R! > R satisfying |3? Yll < C for all 
|8| = 3, 


Ety(Sx)] — Ely Sr] < Cyg 


where 


1 
v=). a (EX) |] + ENÍ). 


Here X s denotes the cube of the ith component of vector X t and 
similarly for Y,. (Hint: abc < (a? +b? + c3) fora, b,c > 0.) 

Deduce multivariate analogues of the Variant Berry—Esseen Theo- 
rem, Remark 11.56, and Corollary 11.59 (using Proposition 11.79). 


11.47 Justify Remark 11.66. (Hint: You’ll need Exercise 10.29.) 


11.48 (a) 


Prove the following: 


Multifunction Invariance Principle. Let F,..., F® be formal 
n-variate multilinear polynomials each of degree at most k € N. Let 
X1,...,X, and y,,..., Y, be independent R4-valued random vari- 
ables such that E[x,] = E[y,] = 0 and M, = Cov[x,] = Cov[y,] 
for each t € [n]. Assume each M, has all its diagonal entries equal 
to 1 (i.e., each of the d components of X, has variance 1, and simi- 


larly for y,). Further assume each component random variable xY ) 
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and ý” is (2, 3, p)-hypercontractive (t € [n], j € [d]). Then for any 


@ function  : R! > R satisfying \|8’ Yll < C for all |B| = 3, 


n 


d 
Ely (F(%))] — ElW(FO))I| < E -0/0 SS Ine, [FO P.. 


t=1 j=l 


Here we are using the following notation: If Z = (Z1, ..., Zn) is a 
sequence of R4-valued random variables, F (Z) denotes the vector 
in R? whose jth component is FIG, hedge Ny: 

(Hint: Combine the proofs of the Basic Invariance Principle and the 
Invariance Principle for Sums of Random Vectors, Exercise 11.46. 
The only challenging part should be notation.) 

Show that if we further have Var[F“] < 1 and Inf,[ FO] < e for 
all j € [d], t € [n], then 


(b 


wm 


EEE] — EEO < SE + kU /p)y* el? 


11.49 (a) Prove the following: 


Invariance Principle in general product spaces. Let (Q, 7) be a 
finite probability space, |Q| = m > 2, in which every outcome has 
probability at least à. Suppose f € L?(Q", n2”) has degree at most 


at k; thus, fixing some Fourier basis ġo, ..., @m—1 for L7(Q, 1), we 
have 
f= X f@be. 
AEN a 
#a<k 


Introduce indeterminates x = (xi j)iein], jetm-1] and let F be the for- 
mal (m — 1)n-variate polynomial of degree at most k defined by 


F= Y fo [| kie 


#a<k iesupp(a@) 


Then for any y : R > R that is @ and satisfies |W" loo < C we 
have 


x~{—1,1}—dn w 


[YCF Œ))] EIO 


< $-(2/2/a)*- Sng fP. 
i=1 
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(Hint: For O0<t<n, define the function h; € L7(Q' x 
{-1, Ly -De—-9, mÊ! Q oe DED via 


h,(@1, see, @t, X41 1s ---> Xawe)) 
= f@ [| ga [| xia. 
#a<k iesupp(@) iesupp(@) 
i<t i>t 
Express 
m 
hi = Erh, + Lih, = Erh; + X Dj - bj(@1) 
j=l 
where 


Di= X fæ [| gaw) T] xia 


0 = j iesupp(@) iesupp(q) 
i<t i>t 


and note that h,_; = E;h,; + pee Dj- Xij.) 


(b) In the setting of the previous theorem, show also that 


wm 


E  WF(g))I EFO 


g~N(O, 1" o 


< E- (2/2/4¥ -J ntf P. 
i=l 


(Hint: Apply the Basic Invariance Principle in the form of Exer- 
cise 11.47. How can you bound the (m — 1)n influences of F in 
terms of the n influences of f?) 
11.50 Prove the following version of the General- Volume Majority Is Stablest 
Theorem in the setting of general product spaces: 


Theorem 11.80. Let (Q, x) be a finite probability space in which 
each outcome has probability at least à. Let f € L?(Q",1®") have 
range [0,1]. Suppose that f has no (e, mao notable coordinates. 
Then for any0 < p < 1, 


Stab [f] < AELS) + O(P) E. 


(Hint: Naturally, you’ll need Exercise 11.49(b).) 


Notes 


The subject of Gaussian space is too enormous to be surveyed here; some recommended 
texts include Janson (Janson, 1997) and Bogachev (Bogachev, 1998), the latter having 
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an extremely thorough bibliography. The Ornstein—Uhlenbeck semigroup dates back to 
the work of Uhlenbeck and Ornstein (Uhlenbeck and Ornstein, 1930) whose motiva- 
tion was to refine Einstein’s theory of Brownian motion (Einstein, 1905) to take into 
account the inertia of the particle. The relationship between the action of U, on func- 
tions and on Hermite expansions (i.e., Proposition 11.31) dates back even further, to 
Mehler (Mehler, 1866). Hermite polynomials were first defined by Laplace (Laplace, 
1811), and then studied by Chebyshev (Chebyshev, 1860) and Hermite (Hermite, 1864). 
See Lebedev (Lebedev, 1972, Chapter 4.15) for a proof of the pointwise convergence 
of a piecewise-@' function’s Hermite expansion. 

As mentioned in Chapter 9.7, the Gaussian Hypercontractivity Theorem is originally 
due to Nelson (Nelson, 1966) and now has many known proofs. The idea behind 
the proof we presented — first proving the Boolean hypercontractivity result and then 
deducing the Gaussian case by the Central Limit Theorem — is due to Gross (Gross, 
1975) (see also Trotter (Trotter, 1958)). Gross actually used the idea to prove his 
Gaussian Log-Sobolev Inequality, and thereby deduced the Gaussian Hypercontractivity 
Theorem. Direct proofs of the Gaussian Hypercontractivity Theorem have been given 
by Neveu (Neveu, 1976) (using stochastic calculus), Brascamp and Lieb (Brascamp and 
Lieb, 1976) (using rearrangement (Brascamp and Lieb, 1976)), and Ledoux (Ledoux, 
2013) (using a variation on Exercises 11.26-11.29); direct proofs of the Gaussian Log- 
Sobolev Inequality have been given by Adams and Clarke (Adams and Clarke, 1979), 
by Bakry and Emery (Bakry and Emery, 1985), and by Ledoux (Ledoux, 1992), the 
latter two using semigroup techniques. Bakry’s survey (Bakry, 1994) on these topics is 
also recommended. 

The Gaussian Isoperimetric Inequality was first proved independently by 
Borell (Borell, 1975) and by Sudakov and Tsirel’son (Sudakov and Tsirel’son, 1978). 
Both works derived the result by taking the isoperimetric inequality on the sphere (due 
to Lévy (Lévy, 1922) and Schmidt (Schmidt, 1948), see also Figiel, Lindenstrauss, 
and Milman (Figiel et al., 1977)) and then taking “Poincaré’s limit” — i.e., viewing 
Gaussian space as a projection of the sphere of radius yn in n dimensions, with 
n — œ (see Lévy (Lévy, 1922), McKean (McKean, 1973), and Diaconis and Freed- 
man (Diaconis and Freedman, 1987)). Ehrhard (Ehrhard, 1983) gave a different proof 
using a symmetrization argument intrinsic to Gaussian space. This may be compared to 
the alternate proof of the spherical isoperimetric inequality (Benyamini, 1984) based on 
the “two-point symmetrization” of Baernstein and Taylor (Baernstein and Taylor, 1976) 
(analogous to Riesz rearrangement in Euclidean space and to the polarization operation 
from Exercise 2.52). 

To carefully define Gaussian surface area for a broad class of sets requires ventur- 
ing into the study of geometric measure theory and functions of bounded variation. 
For a clear and comprehensive development in the Euclidean setting (including the 
remark in Exercise 11.15(b)), see the book by Ambrosio, Fusco, and Pallara (Ambrosio 
et al., 2000). There’s not much difference between the Euclidean and finite-dimensional 
Gaussian settings; research on Gaussian perimeter tends to focus on the trickier infinite- 
dimensional case. For a thorough development of surface area in this latter setting 
(which of course includes finite-dimensional Gaussian space as a special case) see the 
work of Ambrosio, Miranda, Maniglia, and Pallara (Ambrosio et al., 2010); in partic- 
ular, Theorem 4.1 in that work gives several additional equivalent definitions for surf, 
besides those in Definition 11.48. Regarding the fact that RS’, (0*) is an equivalent def- 
inition, the Euclidean analogue of this statement was proven in Miranda et al. (Miranda 
et al., 2007) and the statement itself follows similarly (Miranda, 2013) using Ambrosio 
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et al. (Ambrosio et al., 2013). (Our heuristic justification of (11.14) is similar to the one 
given by Kane (Kane, 2011).) Additional related results can be found in Hino (Hino, 
2010) (which includes the remark about convex sets at the end of Definition 11.48), 
Ambrosio and Figalli (Ambrosio and Figalli, 2011), Miranda et al. (Miranda et al., 
2012), and Ambrosio et al. (Ambrosio et al., 2013). 

The inequality of Theorem 11.51 is explicit in Ledoux (Ledoux, 1994) (see also 
the excellent survey (Ledoux, 1996)); he used it to deduce the Gaussian Isoperimetric 
Inequality. He also noted that it’s essentially deducible from an earlier inequality of 
Pisier and Maurey (Pisier, 1986, Theorem 2.2). Theorem 11.43, which expresses the 
subadditivity of rotation sensitivity, can be viewed as a discretization of the Pisier— 
Maurey inequality. This theorem appeared in work of Kindler and O’ Donnell (Kindler 
and O’Donnell, 2012), which also made the observations about the volume- + case of 
Borell’s Isoperimetric Theorem at the end of Section 11.3 and in Remark 11.76. 

Bobkov’s Inequality (Bobkov, 1997) in the special case of Gaussian space had 
already been implicitly established by Ehrhard (Ehrhard, 1984); the striking novelty 
of Bobkov’s work (partially inspired by Talagrand (Talagrand, 1993)) was his reduc- 
tion to the two-point Boolean inequality. The proof of this inequality which we pre- 
sented is, as mentioned a discretization of the stochastic calculus proof of Barthe and 
Maurey (Barthe and Maurey, 2000). (In turn, they were extending the stochastic cal- 
culus proof of Bobkov’s Inequality in the Gaussian setting due to Capitaine, Hsu, and 
Ledoux (Capitaine et al., 1997).) The idea that it’s enough to show that Claim 11.54 
is “nearly true” by computing two derivatives — as opposed to showing it’s exactly 
true by computing four derivatives — was communicated to the author by Yuval Peres. 
Following Bobkov’s paper, Bakry and Ledoux (Bakry and Ledoux, 1996) established 
Theorem 11.55 in very general infinite-dimensional settings including Gaussian space; 
Ledoux (Ledoux, 1998) further pointed out that the Gaussian version of Bobkov’s 
Inequality has a very short and direct semigroup-based proof. See also Bobkov and 
Götze (Bobkov and Götze, 1999) and Tillich and Zémor (Tillich and Zémor, 2000) for 
results similar to Bobkov’s Inequality in other discrete settings. 

Borell’s Isoperimetric Theorem is from Borell (Borell, 1985). Borell’s proof used 
“Ehrhard symmetrization” and actually gave much stronger results — e.g., that if 
f,g € L?(R", y) are nonnegative and q > 1, then ((U, f)’, g) can only increase under 
simultaneous Ehrhard symmetrization of f and g. There are at least four other known 
proofs of the basic Borell Isoperimetric Theorem. Beckner (Beckner, 1992) observed 
that the analogous isoperimetric theorem on the sphere follows from two-point sym- 
metrization; this yields the Gaussian result via Poincaré’s limit (for details, see Carlen 
and Loss (Carlen and Loss, 1990)). (This proof is perhaps the conceptually simplest one, 
though carrying out all the technical details is a chore.) Mossel and Neeman (Mossel 
and Neeman, 2012) gave the proof based on semigroup methods outlined in Exer- 
cises 11.26-11.29, and later together with De (De et al., 2012) gave a “Bobkov-style” 
Boolean proof (see Exercise 11.30). Finally, Eldan (Eldan, 2013) gave a proof using 
stochastic calculus. 

As mentioned in Section 11.5 there are several known ways to prove the Berry— 
Esseen Theorem. Aside from the original method (characteristic functions), there is 
also Stein’s Method (Stein, 1972, 1986); see also, e.g., (Bolthausen, 1984; Barbour 
and Hall, 1984; Chen et al., 2011). The Replacement Method approach we presented 
originates in the work of Lindeberg (Lindeberg, 1922). The mollification techniques 
used (e.g., those in Exercise 11.40) are standard. The Invariance Principle as presented 
in Section 11.48 is from Mossel, O’Donnell, and Oleszkiewicz (Mossel et al., 2010). 
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Further extensions (e.g., Exercise 11.48) appear in the work of Mossel (Mossel, 2010). 
In fact the Invariance Principle dates back to the 1971 work of Rotar’ (Rotar’, 1973, 
1974); therein he essentially proved the Invariance Principle for degree-2 multilinear 
polynomials (even employing the term “influence” as we do for the quantity in Defi- 
nition 11.63). Earlier work on extending the Central Limit Theorem to higher-degree 
polynomials had focused on obtaining sufficient conditions for polynomials (especially 
quadratics) to have a Gaussian limit distribution; this is the subject of U-statistics. Rotar’ 
emphasized the idea of invariance and of allowing any (quadratic) polynomial with low 
influences. Rotar’ also credited Girko (Girko, 1973) with related results in the case of 
positive definite quadratic forms. In 1975, Rotar’ (Rotar’, 1975) generalized his results 
to handle multilinear polynomials of any constant degree, and also random vectors (as 
in Exercise 11.48). (Rotar’ also gave further refinements in 1979 (Rotar’, 1979).) 

The difference between the results of Rotar’ (Rotar’, 1975) and Mossel et al. (Mossel 
et al., 2010) comes in the treatment of the error bounds. It’s somewhat difficult to extract 
simple-to-state error bounds from Rotar’ (Rotar’, 1975), as the error there is presented as 
a sum over i € [n] of expressions E[F (x)l Fæj>u; l, where u; involves Inf;[F']. (Partly 
this is so as to generalize the statement of the Lindeberg CLT.) Nevertheless, the work 
of Rotar’ implies a Lévy distance bound as in Corollary 11.70, with some inexplicit 
function o,(1) in place of (1/p)°“e'/®. By contrast, the work of Mossel et al. (Mossel 
et al., 2010) shows that a straightforward combination of the Replacement Method and 
hypercontractivity yields good, explicit error bounds. Regarding the Carbery—Wright 
Theorem (Carbery and Wright, 2001), an alternative exposition appears in Nazarov, 
Sodin, and Vol’ berg (Nazarov et al., 2002). 

Regarding the Majority Is Stablest Theorem (conjectured in Khot, Kindler, Mossel, 
and O’Donnell (Khot et al., 2004) and proved originally in Mossel, O’Donnell, and 
Oleszkiewicz (Mossel et al., 2005b)), it can be added that additional motivation for 
the conjecture came from Kalai (Kalai, 2002). The fact that (SDP) is an efficiently 
computable relaxation for the Max-Cut problem dates back to the 1990 work of Delorme 
and Poljak (Delorme and Poljak, 1993); however, they were unable to give an analysis 
relating its value to the optimum cut value. In fact, they conjectured that the case of the 
5-cycle from Example 11.73 had the worst ratio of Opt(G) to SDPOpt(G). Goemans 
and Williamson (Goemans and Williamson, 1994) were the first to give a sharp analysis 
of the SDP (Theorem 11.72), at least for 6 > 0*. Feige and Schechtman (Feige and 
Schechtman, 2002) showed an optimal integrality gap for the SDP for all values 0 > 6* 
(in particular, showing an integrality gap ratio of cgw); interestingly, their construction 
essentially involved proving Borell’s Isoperimetric Inequality (though they did it on 
the sphere rather than in Gaussian space). Both before and after the Khot et al. (Khot 
et al., 2004) UG-hardness result for Max-Cut there was a long line of work (Karloff, 
1999; Zwick, 1999; Alon and Sudakov, 2000; Alon et al., 2002; Charikar and Wirth, 
2004; Khot and Vishnoi, 2005; Feige and Langberg, 2006; Khot and O’ Donnell, 2006) 
devoted to improving the known approximation algorithms and UG-hardness results, in 
particular for < 6*. This culminated in the results from O’ Donnell and Wu (O’ Donnell 
and Wu, 2008) (mentioned in Remark 11.75), which showed explicit matching (a, 8)- 
approximation algorithms, integrality gaps, and UG-hardness results for all 5 <ß<l. 
The fact that the best integrality gaps matched the best UG-hardness results proved not to 
be a coincidence; in contemporaneous work, Raghavendra (Raghavendra, 2008) showed 
that for any CSP, any SDP integrality gap could be turned into a matching Dictator-vs.- 
No-Notables test. This implies the existence of matching efficient (a, 6)-approximation 
algorithms and UG-hardness results for every CSP and every 8. See Raghavendra’s 
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thesis (Raghavendra, 2009) for full details of his earlier publication (Raghavendra, 2008) 
(including some Invariance Principle extensions building further on Mossel (Mossel, 
2010)); see also Austrin’s work (Austrin, 2007a,b) for precursors to the Raghavendra 
theory. 

Exercise 11.31 concerns a problem introduced by Talagrand (Talagrand, 1989). 
Talagrand offers a $1000 prize (Talagrand, 2006) for a solution to the following Boolean 
version of the problem: Show that for any fixed 0 < p < 1 and for f : {—1, 1}"” > R=? 
with E[ f] = 1 it holds that Pr[T, f > t] = o(1/t) as t — oo. (The rate of decay may 
depend on p but not, of course, on n; in fact, a bound of the form OG) is expected.) 
The result outlined in Exercise 11.31 (obtained together with James Lee) is for the very 
special case of 1-dimensional Gaussian space; Ball, Barthe, Bednorz, Oleszkiewicz, and 
Wolff (Ball et al., 2013) obtained the same result and also showed a bound of oE) 
for d-dimensional Gaussian space (with the constant in the O(-) depending on d). 

The Multifunction Invariance Principle (Exercise 11.48 and its special case Exer- 
cise 11.46) are from Mossel (Mossel, 2010); the version for general product spaces 
(Exercise 11.49) is from Mossel, O’ Donnell, and Oleszkiewicz (Mossel et al., 2010). 


Some Tips 


You might try using analysis of Boolean functions whenever you’re faced 

with a problems involving Boolean strings in which both the uniform prob- 

ability distribution and the Hamming graph structure play a role. More 

generally, the tools may still apply when studying functions on (or subsets 

of) product probability spaces. 

If you’re mainly interested in unbiased functions, or subsets of volume }, 

use the representation f : {—1, 1}” —> {—1, 1}. If you’re mainly interested 

in subsets of small volume, use the representation f : {—1, 1}" — {0, 1}. 

As for the domain, if you’re interested in the operation of adding two strings 

(modulo 2), use F}. Otherwise use {—1, 1}”. 

If you have a conjecture about Boolean functions: 

— Test it on dictators, majority, parity, tribes (and maybe recursive majority 
of 3). If it’s true for these functions, it’s probably true. 

— Try to prove it by induction on n. 

— Try to prove it in the special case of functions on Gaussian space. 

Try not to prove any bound on Boolean functions f : {—1, 1} > {-1, 1} 

that involves the parameter n. 

Analytically, the only multivariate polynomials we really know how to con- 

trol are degree-1 polynomials. Try to reduce to this case if you can. 

Hypercontractivity is useful in two ways: (i) It lets you show that low- 

degree functions of independent random variables behave “reasonably”. 

(ii) It implies that the noisy hypercube graph is a small-set expander. 

Almost any result about functions on the hypercube extends to the case of 

the p-biased cube, and more generally, to the case of functions on products 

of discrete probability spaces in which every outcome has probability at least 

p — possibly with a dependence on p, though. 

Every Boolean function consists of a junta part and Gaussian part. 
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for sums of random vectors 
nonuniform, 126 
Variant, 356, 359 
biased Fourier analysis, 211 
bit, 1, 2 
BLR (Blum-Luby-Rubinfeld) Test, 15, 163, 
165, 188 
derandomized, 148, 161 
BLR+NAE Test, 166 
Bobkov’s Inequality, 347-350, 377, 390 
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constraint satisfaction problem, see CSP 
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correlated Gaussians, 328 
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correlation distillation, 51, 123 
correlation immune, 136, 160 
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covariance, 10 
cryptography, 68, 77, 94 
CSP, 173-183 
equivalence with testing, 177 
cube, Hamming, 2 


decision list, 73 
decision tree, 58, 222 
depth, 59 
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Fourier spectrum, 59 
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product space domains, 223 
randomized, 222, 235 
read-once, 73 
size, 59 
decision tree process, 226 
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product space domains, 206 
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general product space, 211 
density function, see probability density 
derandomization, 145-149 
derivative operator, 30 
biased Fourier analysis, 213-214 
Dickson’s Theorem, 155 
dictator, 27 
biased Fourier analysis, 213 
dictator testing, see testing, dictatorship 
Dictator-vs.-No-Notables test, 182, 369 
connection with hardness, 182, 366 
for Max-E3-Lin, 183-186 
directional derivative, 151 
discrete cube, see cube, Hamming 
discrete derivative, see derivative operator 
discrete gradient, see gradient operator 
distance, relative Hamming, 9 
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Fourier spectrum, 82, 87-89, 95 
read-once, 95 
size, 80 
width, 80, 265 
domain (CSP), 174 
dual group, 221, 234 
dual norm, 253 
dual, Boolean, 19 
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Efron-Stein decomposition, see orthogonal 
decomposition 

Efron-Stein Inequality, see Poincaré 
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entropy functional, 318 

(€, 5)-small stable influences, 133, 181 

(e, k)-regular, 134 

(€, k)-wise independent, 134, 143-144 

e-biased set, €-biased density, see probability 
density 

€-close, 15 

€-fools, see fooling 

e-regular, 132 

€-uniform, see €-regular 

equality function, 17, 154 

Erd6és—Rényi random graph, see random graph 

even function, 19 

exclusive-or, see parity 

expansion, 36 

small-set, 36, 113, 249, 258, 262, 280, 
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expectation operator, 32, 203 
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F-degree, 138, 150 
F-polynomial representation, 136-138, 
159 
learning, 157 
Fy (finite field), 141 
Fast Walsh—Hadamard Transform, 20 
FKN Theorem, 45, 117, 245 
folding, 190 
fooling, 149 
Fourier analysis of Boolean functions, see 
analysis of Boolean functions 
Fourier basis, 199, 335, 337 
Fourier coefficient, 4 
formula, 8 
product space domains, 201 
Fourier Entropy—Influence Conjecture, 
266 
Fourier expansion, 2-5 
product space domains, 201 
Fourier norm, 57 
1- , 20, 69, 72, 73, 78, 145-148 
4- , 22, 133, 149, 159 
Fourier sparsity, 57, 75, 273 
Fourier spectrum, 4 
Fourier weight, 10 
degree-1, 44, 111-112, 128 
general product space, 211 
Fp (finite field), 221 
Friedgut’s Conjecture, 302 
Friedgut’s Junta Theorem, 263-265, 305 
product space domains, 291, 301 
Friegut’s Sharp Threshold Theorem, 302 


Gaussian isoperimetric function, 112, 128, 
343 

Gaussian Isoperimetric Inequality, 343-347, 
389-390 

Gaussian Minkowski content, see Gaussian 
surface area 

Gaussian noise operator, 329, 389 

Gaussian quadrant probability, 107, 127, 270, 
376, 379 

Gaussian random variable, 104, 105 

simulated by bits, 327 

Gaussian space, 326, 388 

Gaussian surface area, 343—347, 375, 389 

Gaussian volume, 326 

General Hypercontractivity Theorem, see 
Hypercontractivity Theorem, General 

Goemans—Williamson Algorithm, 179, 
366-368, 383 


Goldreich—Levin Algorithm, 68-71, 146-148 
Gotsman-Linial Conjecture, 121, 346 
Gotsman-—Linial Theorem, 100, 102 
Gowers norm, 158 
gradient operator, 35 
granularity, Fourier spectrum, 20, 57, 58, 59, 
75, 155 
graph property, 215, 291 
monotone, 215, 302 
Guilbaud’s Formula, 44 


Hadamard Matrix, 20 
halfspace, see linear threshold function 
Hamming ball, 46 
degree-1 weight, 112 
Hamming cube, see cube, Hamming 
Hamming distance, 2 
harmonic analysis of Boolean functions, see 
analysis of Boolean functions 
Hatami’s Theorem, 304 
Hausdorff—Young Inequality, 72 
hemi-icosahedron function, 18 
Hermite expansion, 338 
Hermite polynomials, 335-338, 373-375, 389 
multivariate, 337 
Hoeffding decomposition, see orthogonal 
decomposition 
Holder inequality, 247 
hypercontractivity, 24, 102, 250-251, 270, 
275-277, 278, 283-288, 323 
(2, q)- and (p, 2)-, 240, 251-256 
biased bits, 287 
general product probability spaces, 
315-318 
induction, 254-256, 281, 311 
preserved by sums, 250, 310 
Hypercontractivity Theorem, 240, 269, 
278-283 
Gaussian, 331-332, 389 
General, 278, 288 
Two-Function, 254—256, 276, 279-281, 
378 
Hypercontractivity Theorem 
Reverse, 312, 323 
hypercube, see cube, Hamming 


impartial culture assumption, 28 
indicator basis, 198 

indicator function, 12, 17 
indicator polynomial, 3 
induction, 254 
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influence, 29-31 
p-stable, see stable influence 
average, 49, 119 
biased Fourier analysis, 214 
coalitional, 271 
maximum, 260 
product space domains, 203-205 
inner product, 6 
inner product mod 2 function, 17, 103, 132, 
137, 140, 150, 188 
instance (CSP), 175 
Invariance Principle, 390 
basic, 360, 370 
for sums of random variables, 354 
for sums of random vectors, 386 
general product spaces, 387 
multifunction, 386 
Invariance Principles, 359-366, 386-388 
isomorphic, 23 
isoperimetric inequality 
Hamming cube, 36, 127, 262, 319, 348 
Itô’s Formula, 348 


junta, 27, 265 
learning, 75, 144-145, 158, 161 


k-wise independent, 136, 142-143, 160 
Kahn—Kalai—Linial Theorem, see KKL 
Theorem 
Khintchine(—Kahane) Inequality, 51, 101, 
257 
KKL Theorem, 83, 260-263, 277 
edge-isoperimetric version, 262 
product space domains, 290 
Kravchuk polynomials, 126, 375 
Krawtchouk polynomials, see Kravchuk 
polynomials 
Kushilevitz function, see hemi-icosahedron 
function 
Kushilevitz—Mansour Algorithm, see 
Goldreich—Levin Algorithm 


L?, 197 
Lévy distance, 358, 365 
Laplacian operator, 35 
ith coordinate, 32, 204 
learning theory, 64—68, 119, 145-148 
Level-k Inequalities, 250, 259 
level-1 Fourier weight, see Fourier weight, 
degree-1 
Level-1 Inequality, 113, 259, 269 
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Lindeberg Method, see Replacement Method 
linear (over F2), 14 
linear threshold function, 27, 99-100, 265 
Fourier weight, 100-101 
learning, 119 
noise stability, 107, 118-121, 127 
literal, 79 
LMN Theorem, 93 
locally correctable, 16 
locally testable proof, see PCPP 
Log-Sobolev Inequality, 276, 318-319 
Gaussian, 334, 389 
product space domains, 320 
Low-Degree Algorithm, 67, 76 
low-degree projection, see projection, 
low-degree 
LTF, see linear threshold function 


Möbius inversion, 154 
majority, 3, 18, 26 
Fourier coefficients, 109 
Fourier weight, 108-111 
noise stability, 38, 106-108, 125, 127 
total influence, 34, 104-105 
Majority Is Least Stable Conjecture, 121 
Majority Is Stablest Theorem, 108, 114, 325, 
359, 366, 370-372 
general product spaces, 388 
Mansour’s Conjecture, 82 
Margulis—Russo Formula, 216, 231, 291 
martingale 
Doob, 229 
martingale difference sequence, 229, 275 
Max-2-Lin, 383 
Max-3-Coloring, 174, 175 
Max-3-Lin, 174, 179, see also 
Dictator-vs.-No-Notables test for 
Max-E3-Lin 
HAstad’s hardness for, 180 
Max-3-Sat, 174, 180, 188 
Hastad’s hardness for, 180 
Max-CSP(W), 174-177 
Max-Cut, 174, 179, 366-370 
Max-w, 175 
May’s Theorem, 28 
mean, 9, 135 
Mehler transform, see Gaussian noise operator 
Minkowski content, see Gaussian Minkowski 
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mod 3 function, 17, 156 
mollification, 357, 384-385 
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monotone 
DNF, 94 
monotone function, 28 
learning, 67, 269 
monotone graph property, see graph property, 
monotone 
multi-index, 200 
multilinear polynomial, 2 


n-cube, see cube, Hamming 
NAE Test, 164 
noise operator, 39 
applied to individual coordinates, 298 
Gaussian, see Gaussian noise operator 
product space domains, 205 
noise sensitivity, 38, 369 
Gaussian, see rotation sensitivity 
vs. total influence, 119 
Noise Sensitivity Test, 369 
noise stability, 37—40 
product space domains, 205 
uniform, see uniformly noise-stable 
noisy hypercube graph, 248, 270 
noisy influence, see stable influence 
norm, 6 
normal random variable, see Gaussian random 
variable 
not-all-equal (NAE) function, 17, 42 
notable coordinates, 41, 133, 181 
NP-hard, 178, 188 
number operator, see Ornstein—Uhlenbeck 
operator 
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optimum value (CSP), 176 

OR function, 27, 302 

Ornstein—Uhlenbeck operator, 332, 339 

Ornstein—Uhlenbeck semigroup, see Gaussian 
noise operator 

orthogonal complement, see perpendicular 
subspace 

orthogonal decomposition, 207-211, 237 

orthonormal, 7, 199 

OS Inequality, 224, 269 

OSSS Inequality, 224, 236, 364 

OXR function, 17, 192 


(p, 2)-hypercontractivity, see 
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p-biased Fourier analysis, see biased Fourier 
analysis 
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Paley—Zygmund inequality, 242 
parity, 5, 93, 95, 96, 136 
parity decision tree, 74 
Parseval’s Theorem, 8, 202, 338 
complex case, 233 
PCP Theorem, 173, 179 
PCPP, 168-172 
PCPP reduction, 172-173 
Peres’s Theorem, 118, 265 
perpendicular subspace, 57 
pivotal, 29, 46, 231 
Plancherel’s Theorem, 8, 202, 338 
complex case, 219, 233 
Poincaré Inequality, 36, 262, 319 
Poisson summation formula, 63 
polarization, 50, 272 
polynomial threshold function, 101-102, 265 
degree, 124 
Fourier spectrum, 102-103 
noise stability, 121, 128 
sparsity, 102, 103, 124 
total influence, 121—122, 128 
predicates (CSP), 174 
probabilistically checkable proof of proximity, 
see PCPP 
probability density, 12 
e-biased, 132, 141—142 
e-biased density, 146 
product basis, 199, 337 
product probability space, 197 
product space domains, 197—211 
projection 
low-degree, 267, 296-298 
projection onto coordinates, 74, 202 
property testing, see testing 
local tester, 163, 168 
pseudo-junta, 304, 321 
PTF, see polynomial threshold function 


Rademacher functions, 24 

random function, 19, 46, 75, 123, 124, 131, 
153 

random graph, 215, 322 

random subset, 84 

randomization/symmetrization, 284—286, 
293-301, 305, 313, 314, 323 

randomized assignment, 189 

reasonable random variable, 241, 284, 351, 
360 

recursive majority, 27, 223, 235 
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regular, see €-regular 
relevant coordinate, 30 
Replacement Method, 352, 361 
resilient, 136, 160 
restriction, 59-62 
Fourier, 61 
random, 84—86 
to subspaces, 62-63 
revealment, 223, 235, 236 
Reverse Hypercontractivity Theorem, see 
Hypercontractivity Theorem, Reverse 
Reverse Small-Set Expansion Theorem, see 
Small-Set Expansion Theorem, Reverse 
p-correlated Gaussians, see correlated 
Gaussians 
p-correlated strings, see correlated strings 
p-stable hypercube graph, see noisy hypercube 
graph 
rotation sensitivity, 341, 390 
subadditivity, 342, 346 
Russo—Margulis Formula, see Margulis—Russo 
Formula 


satisfiable, 176 
SDP, see semidefinite programming 
second moment method, see Paley—Zygmund 
inequality 
selection function, 17 
semidefinite programming, 367 
semigroup property, 48, 269, 329, 373 
sensitivity, 33 
set system, 1 
Shapley value, 232 
Shapley—Shubik index, see Shapley value 
sharp threshold, see threshold, sharp 
Sheppard’s Formula, 107, 330 
shifting, see polarization 
Siegenthaler’s Theorem, 138-139, 145, 160 
small stable influences, see (€, 5)-small stable 
influences 
Small-Set Expansion Theorem, 258, 270 
generalized, 280, 323, 378 
product space domains, 289 
Reverse, 280, 311, 313, 323 
social choice, 26 
social choice function, 26 
sortedness function, 17 
sparsity (fractional), 72 
spectral concentration, see concentration, 
spectral 
spectral norm, see Fourier norm 
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spectral sparsity, see Fourier sparsity 
stable influence, 41, 133, 249, 259 
product space domains, 206, 289 
Stirling’s Formula, 47 
string, 1 
subcube, 58 
degree-1 weight, 112 
subcube partition, 74 
subspaces, 57 
Switching Lemma 
Baby, 87, 97 
Hastad’s, 87, 90-92 
symmetric function, 28 
symmetric random variable, 284 


Tp, see noise operator 
tensorization, see hypercontractivity, induction 
term (DNF), 79 
test functions, 353 
Lipschitz, 357 
testing, 14, 162-164 
dictatorship, 164 
linearity, 15 
threshold function, see linear threshold 
function 
threshold phenomena, 215 
threshold, sharp, 217, 218, 231, 291-293, 301, 
303, 322 
threshold-of-parities circuit, 102, 103, 123, 
124 
total influence, 32-36 
DMF formulas, 81, 86, 96, 231 
monotone functions, 34 
product space domains, 204, 301 
total variation distance, 21 
transitive-symmetric function, 28, 49, 215, 
234, 291 
decision tree complexity, 224 
tribes function, 28, 46, 53, 82-84, 95, 260 
Two-Point Inequality, 281 
Reverse, 312 


U,, see Gaussian noise operator 
UG-hardness, 182, 192, 366 

unate, 46, 120 

unbiased, 9 

uncertainty principle, 73 

uniform distribution, 7 

uniform distribution on A, 12 
uniformly noise-stable, 118, 265, 359 
Unique-Games, 182, 191, 195, 366 
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value (CSP), 176 weight, see Fourier weight 
variance, 9 weighted majority, see linear threshold 
Viola’s Theorem, 150 function 


voting rule, see social choice function 

XOR, see parity 
Walsh functions, 24 
Walsh—Hadmard Matrix, 20 Yao’s Conjecture, 224 


